Background & Summary

Walking is a fundamental yet challenging feature of human movement, allowing individuals to freely explore their environment. This challenge may be increased where irregular surface topographies are encountered. Unfortunately, gait research is almost exclusively conducted in laboratories with ideal surfaces, making the extrapolation to how irregular surfaces affect movement difficult. Past projects1,2,3,4, including work from our group5,6, have implemented irregular surfaces in laboratories, providing insights into biomechanical adaptations and clinical applications; however, the data associated with these studies are not publically available. Few irregular surface gait datasets are freely available in public repositories. Most recently, Laschowski et al.7 published a dataset of wearable camera images during human movement in different environments. While these data are relevant for image classification and other engineering applications8, they do not provide any biomechanical gait data related to the participants. In 2020, Luo et al.9 released a dataset comprised of 30 young healthy participants walking over several outdoor surfaces while fit with inertial measurement unit (IMU) sensors on their lower limbs. Our group has used that dataset to develop IMU-based machine learning classification algorithms10,11 and to assess biomechanical adaptations to irregular surfaces12. It has also been widely cited and used by others (for example, see13,14). Nonetheless, the dataset is rather small, limiting applications, especially for deep learning projects.

Here, we present a new irregular surface dataset comprising 134 participants walking over surfaces with 4 levels of irregularity while fit with IMU sensors on the trunk and right lower limb (foot, shank, and thigh). The data were collected over three days in the congress hall of the North American Congress on Biomechanics (NACOB) conference in 2022. This data descriptor manuscript describes the methods used to collect and process this publically available dataset. Moreover, we conduct a technical validation, showing that the data contain sufficient information for machine learning classifiers or biomechanical kinematic analyses and share associated usage notes and code, allowing others to reproduce our work.

Methods

Participants

162 participants were recruited from the NACOB conference attendees from August 22–25th, 2022, in Ottawa, Canada. Inclusion criteria were purposely broad; only participants who affirmed being unable to safely walk over irregular surfaces were excluded. Any gait pathologies or relevant impairments were noted and retained in the metadata of this data descriptor. Participants signed a consent form prior to participation, allowing the inclusion of their anonymized data in a public repository. The form was available at our conference kiosk, providing participants with ample time to review its content prior to participation. The University of Montreal ethics committee (Comité d’éthique de la recherche clinique of the Université de Montréal) approved the study under the project number 2022-1557. The data underwent rigorous inspection to identify inconsistencies, errors, and missing information. Notably, some trials exhibited malfunctions in all 4 sensors (4 participants), while a few encountered issues, such as non-recording of the video (3 participants) or absence of synchronization video (21 participants). Those trials were subsequently discarded. It is worth noting that trials where 1, 2 or 3 sensor malfunctions occurred were kept to provide as much raw data as possible. The final dataset comprises 134 participants. Anthropometric data for these remaining participants are presented in Table 1.

Table 1 Participant anthropometrics.

Experimental protocol

First, anthropometric data (age, height, weight, limb length) were measured (Table 1). Second, participants were fit with wireless IMUs (Dot, Xsens Tech. B.V., Enschede, Netherlands) at the lower back and right thigh, shank, and foot. Sensor data were collected at 120 Hz. These specific sensors were selected given Xsens’ reputation for highly accurate and reliable gait outputs15. Moreover, the Dot system includes an open-software development kit (SDK) allowing for the use of bespoke smartphone data collection applications. Finally, participants were instructed to walk back and forth at their own pace on a circuit of five different surface types: 1 = irregular (low), 2 = artificial grass, 3 = artificial paving stone, 4 = irregular (high), 5 = flat industrial carpet (Figs. 1, 2) for a total duration of 2 minutes. Irregular surfaces (low and high) were manufactured by Terrasensa (Otto Bock HealthCare GmbH, Duderstadt, Germany) and are primarily used in physiotherapy clinics for gait rehabilitation purposes. Both surfaces are shock-absorbing with an average height of 5 centimeters for the high irregular surface and 2 centimeters for the low irregular surface. The remaining surfaces were acquired from local hardware stores. The average height of the artificial grass blades was 2 centimeters. The paving stone surface had only minor height variations (approximately 0.2 centimeters). Surfaces were selected since they were relatively easy to transport to and implement at the conference venue and, based on previous research, should drive subtle, but important biomechanical gait adaptations, compared to the standard surface6. Video recordings (GoPro, Inc. San Mateo, USA) were collected at 120 Hz for each trial. The video recordings allowed for post-hoc partitioning of the trials by surface and thus permitted for quicker data collection (a single continuous recording vs multiple starts and stops). Since the participants walked at their own pace, it was expected that the duration on each surface would not be the same across all participants. On average, 16.6% of the time was spent on surface 1, 25.0% on surface 2, 24.1% on surface 3, 22.8% on surface 4, and, 11.5% on surface 5.

Fig. 1
figure 1

Walking circuit. Participants walked back and forth following this circuit. The turns were performed on the carpet surface.

Fig. 2
figure 2

Walking surfaces. Surfaces labeled as 1: irregular (low), 2: artificial grass, 3: artificial paving stone, 4: irregular (high), 5: flat industrial carpet (conference hall flooring).

Video annotation

To annotate the recorded videos for each trial, we developed an open-source graphical tool (https://github.com/oussema-dev/video_annotation). The annotation process involved marking the start and end of surface contact. The initiation of contact was identified when the heel of the right foot made first contact with the surface, and the termination was marked when the toe concluded its last contact. This implies that 180° turning manoeuvers required for participants to travel from one surface to the next have been excluded from the dataset. Two annotators performed this task in parallel for the same videos, and a third annotator (O.J.) cross-validated their annotations following these rules: (1) trials with annotation discrepancies exceeding 120 frames between annotators (equivalent to one second) were re-annotated. Remaining annotations for the rest of the trial were aligned with the annotator closest to the corrected segment, (2) trials with annotation discrepancies less than 120 frames, are tagged following annotations from either annotator 1 or 2 (at random).

Data synchronization

The next step involved determining the difference in start times between the IMU data collection and the video recordings to adjust the offset of video annotations. To do so, we took note of the timestamp at which both the video and the IMU sensors started recording. The IMU sensors were triggered using an iPhone, and the GoPro was triggered using a laptop. Both the laptop and the iPhone clocks used the same server for system clock synchronization. Typically, the sensors started recording before the GoPro, resulting in the sensor’s timestamp being less than the video’s timestamp. The synchronization was achieved by subtracting these timestamps to determine the offset, which allowed us to align the video and sensor data accurately. On average, the offset was found to be 719 frames (corresponding to nearly 6 seconds). Subsequently, the sensor data of each trial were trimmed according to the determined offset, ensuring precise synchronization between the video and sensor data.

Data formatting

The formatting phase encompassed the concatenation of sensor data with their respective annotations and anthropometric data for each participant. It is noteworthy that missing anthropometric data from 5 (3.7%) participants who did not provide this information during the data collection were imputed with NaN values. An additional step, for some trials, involved aligning the data by adding missing sensor data columns (imputed with NaN values) to ensure column dimension uniformity across all trials. Specifically, 5 (3.7%) participants lacked foot sensor data, 7 (5.2%) lacked shank sensor data, 10 (7.5%) lacked thigh sensor data, and 7 (5.2%) lacked trunk sensor data. Figure 3 illustrates these steps diagrammatically.

Fig. 3
figure 3

Steps performed for the processing of the raw data files. Missing anthropometric data were imputed with NaN values and merged with surface annotations and raw data files. The align data step added missing sensor data columns if necessary (imputed with NaN values), ensuring uniform column dimensions across all trials.

Data Records

All published raw data are fully anonymized and are available in the raw_data folder on figshare16. Each participant’s trial data file is named P_x.csv, where x denotes the participant number. The first 24 columns are IMU data, column 25 is the participant ID, columns 26 to 32 are the anthropometric data, and the last column “class” is the surface being walked on. Table 2 describes the different data columns in each file. We also provide the concatenated_data.csv file under the data folder. This file combines all participant data making it easier to run machine learning tasks directly rather than processing each participant file individually. The same data structure as the raw data is kept. To summarize the data according to each surface in a visual manner, we plotted radar charts of the mean, max, and min resultant (magnitude) vectors of the acceleration and gyroscope for the different sensor placements (foot, shank, thigh, and trunk) (Fig. 4).

Table 2 Data columns description.
Fig. 4
figure 4

Mean (a), max (b) and min (c) magnitude vectors for the foot, shank, thigh, and trunk sensors, across all surfaces.

Technical Validation

For technical validation, we used the concatenated_data.csv file (generated using the provided concatenation script, see the usage notes) to extract 8 statistical features (e.g. mean, minimum, maximum, standard deviation, interquartile range, median absolute deviation, area under the curve, and signed area under the curve) for each IMU data column using python functions implemented within the pandas17, scipy18, and numpy19 packages. This process resulted in 192 features (24 × 8). For the purposes of this demonstration, we developed binary classification models: irregular (surface 1 and 4) vs flat (surface 5). Subsequently, the dataset was utilized to train an XGBoost model. This architecture was chosen for its ability to model complex relationships, handle missing data, and its robust performance20. The model was coupled with an inter-subject (subject-wise) splitting approach to partition the dataset into training and testing sets, ensuring that all trials corresponding to the same participant resided within a singular set21. Prior to implementing the XGBoost model, we used filter methods for feature reduction. During this step, we discarded highly correlated features, features with invariant values across the dataset, and features with a low variance (0.1). The resulting model yielded an accuracy score of 95.8% on the test set.

Usage Notes

The raw data is stored within the raw_data folder and can be imported using the read_data.py Python script. To streamline dataset manipulation, we provide three additional Python scripts: A concatenation script (concatenate.py) combines the raw data files vertically and generates a CSV file named concatenated_data.csv. A class management Script (fuse_classes.py) allows users to exclude specific surfaces from the dataset and/or fuse surfaces into the same class. The script modifies the concatenated_data.csv file. Finally, the calculate_statistical_features.py script extracts statistical features per signal and surface segment from the concatenated data. The output is saved as a CSV file named statistical_features.csv. The resulting data files are stored within the data folder. The code related to training the feature-based model (xgb_statistical_features.py) is provided in the repository.