Introduction

The upper limb serves many essential roles during daily life—eating lunch, opening a door, and embracing another. Each year 800,000 people experience a new stroke in the US alone, which affects upper limb ability in the majority of cases1,2.

Rehabilitation has been described as a cycle predicated by assessment3. Assessment of the post-stroke upper limb has cultivated the development of many validated methods, which span across the important domains of the International Classification of Function4. According to a centralized database, over 20 methods are documented for assessment of the post-stroke upper limb and a majority involve observation-based assessment5. While human observation is relatively simple, inherent drawbacks of this approach include subjectivity due to human evaluators and poor resolution for subtle change due to course ordinal scales6. While a compliment of instrument-based methods for assessment are present in research literature, clinical practice continues to rely heavily on observation-based assessments.

Kinematic assessment involves quantitative study of the body, limbs and joints during movement. Over the last two decades, researchers have adapted advancing technologies to study upper limb movement7. With potential of increased objectivity, these advances offer exciting complementary tools to existing assessment methods. Perhaps unsurprising, the rehabilitation research community is now leveraging kinematics assessment to better understand recovery trends for the upper limb among stroke survivors. This has led to well-documented protocols, established normative data, and clinically validated kinematic metrics8,9.

The application of kinematic assessment in research is benefiting our understanding of post-stroke rehabilitation. Compensatory movement and restitution of pre-stroke movement have been recognized as important and unique phenomena—the former illustrated by a stroke survivor’s truncal lean during reaching tasks to offset reduced arm extension at elbow10. While this behavior may be observable, it is not well-captured via observation-based scales, which typically employ ordinal scales and often emphasize task completion over task quality. However, using kinematic measures, behaviors can be quantified on a granular scale to determine the status of an individual’s motor recovery along a continuous spectrum between compensation and restitution11,12,13.

Advances in kinematic assessment are gaining momentum and recently culminated in consensus statements recommending kinematic metrics for standardized use in stroke research14. However, the authors concede that several barriers exist to these recommendations. Namely, kinematic assessment relies traditionally on marker-based optoelectronic systems that are considered time-consuming and impractical for broad use15.

To potentially reduce these barriers, computer vision (CV) has been considered as means towards markerless motion capture16,17,18,19,20. CV is a subfield of artificial intelligence that seeks to extract useful information from a digital image21. For markerless motion capture applications, the fundamental ingredients include ordinary cameras (i.e. RGB cameras in smartphones) and human pose estimation—a machine learning solution capable of estimating anatomic keypoints of the human body from a sequence of two-dimensional images (i.e. video). A variety of machine learning solutions are now available for this human pose estimation including OpenPose22, MediaPipe23, PoseNet24, and MoveNet25. When using two or more cameras, the two-dimensional human pose estimates can then be lifted to the three-dimensional space, which bookends a CV pipeline capable of capturing kinematics of the human body and limbs without need for markers.

Numerous studies support the potential application of CV as an assistive tool in medicine26. For example, CV has been used to estimate risk of musculoskeletal disorders by assessing the ergonomics of posture27. To support in-home rehabilitation, another study introduced a telehealth system based on CV28. To support applications in human kinematic assessment, several studies have explored the concurrent validity of CV systems versus traditional marker-based motion capture systems for locating joint positions, which have revealed errors on the scale of 20–30 mm16,17. In recent works, locomotion studies with healthy subjects have demonstrated the potential of CV for predicting important kinematic-based clinical metrics such as walking speed and gait deviation index29.

With regards to stroke rehabilitation, prior studies have applied CV to evaluate upper limb movement. Many early implementations leveraged a popular off-the-shelf camera containing an integrated depth sensor (i.e. Microsoft Kinect RGB-D) along with its associated software (i.e. Microsoft Kinect Skeleton Tracking algorithm)30,31,32,33,34. While one of these studies considers using CV to achieve a well-recognized upper limb assessment in stroke rehabilitation research33, none of these studies consider CV to achieve the kinematic metrics recommended for standardized use in stroke rehabilitation research14.

In more recent implementations of CV, studies have utilized common hardware (e.g. handheld RGB cameras) for assessment during stroke rehabilitation. For example, by applying human pose estimation (i.e. AlphaPose) to extract movement data of the hands, body, and face from camera footage, investigators then used this data to train a deep neural network to estimate the possibility of stroke35. In another study, movements of the body were again extracted from RGB camera footage using a similar human pose estimation (i.e. MediaPipe), and the investigators trained a binary classification model to detect compensatory upper limb movements among stroke survivors36. In a similar study using the same human pose estimation and a single RGB camera, investigators created virtual alternatives for common tests of distal upper limb dexterity (e.g. Box and Block Test)37. In a recent feasibility study using two RGB cameras, a CV system was utilized for 3D motion tracking of fine hand motor skills38. This study provided initial evidence for accurate tracking in both the coronal and sagittal planes, and it demonstrated feasibility of object tracking during manipulation (e.g. moving small block with chopsticks).

Despite promising validation efforts, there remain limited studies to support use of CV in rehabilitation populations39. Our objective is to move forward validation efforts of CV for post-stroke upper limb assessment. Leveraging an open-source CV solution, common RGB cameras, and collaboration between medicine and engineering, we present a pilot study of a markerless motion capture system developed in-house for capturing kinematic metrics recommended for standardized assessment of the post-stroke upper limb14, and our primary goal is to investigate feasibility in a neurotypical population and obtain preliminary evidence on accuracy.

Methods

Each participant performed the activity five times during a single session located at a clinical research laboratory inside a freestanding inpatient rehabilitation facility. The study was advertised via flyers posted across campus of a nearby major academic institution. In total, 10 participants were recruited from a sample of convenience, and all participants completed the data collection. Inclusion criteria were age between 18 and 80 years old, self-assessed absence of neurologic and musculoskeletal conditions affecting the upper limb, and ability to perform a simple reaching activity (e.g. drinking water from a cup). Participant demographics are shown in Table 1. As a primary outcome, this pilot study examined feasibility of CV to obtain upper-limb kinematic metrics during the drinking task, which have been recommended for standardized use in post-stroke rehabilitation research14. As a secondary outcome, this pilot study sought to collect preliminary data about the accuracy of CV in this application as well as collect estimates on the intraclass correlation within repeated task performances by participants. The protocol for data collection was reviewed and approved by the University of Kentucky Institutional Review Board (IRB #63176), and informed written consent was obtained from all participants prior to data collection.

Table 1 Participant demographics

Drinking task kinematics

The drinking task activity has been previously protocolized and involves the following five phases of movement: reaching to grasp a cup of water, forward transport of cup to mouth, drinking, back transport of cup to table, and return to starting position11. The drinking task activity is depicted in Fig. 1. For start position, participants were seated with arms at side, elbows bent and hands located in a pronated position such that wrist crease coincided with edge of tabletop. Using a height-adjustable stool, the seat height was set such that elbows, hips and knees were flexed at approximately 90°. The cup was filled with 100 mL of water, and its starting location was midline in front of participant in a 100 mm by 100 mm bounding box with centroid located 300 mm from the edge of tabletop. Limited instruction was provided to participants so as to foster self-selected movements and reduce potential distraction to the participant’s usual behavior. Each participant was permitted up to two practice reps before the formal data acquisition. Based on motion capture data obtained during this drinking task, several kinematic metrics can be obtained. We focus on primarily three kinematic metrics: movement time (MT), number of movement units (NMU), and trunk displacement (TD). In prior work utilizing principal component analysis techniques, these kinematic metrics have been shown to substantially capture the variance in the drinking task activity, to have high discriminant ability, and to be valid compared to current clinical assessments for individuals with post-stroke upper limb impairment40,41.

Fig. 1
figure 1

Drinking task activity. Progression of the drinking task activity through five phases as demonstrated via still images extracted from a single trial for one participant.

Computer vision system

To capture kinematics of the upper limb, our study employed a CV system developed in-house. The system can be described according to three factors: cameras, human pose estimation, and three-dimensional key point determination (“lifting”). A dual-camera approach was employed in our system, and each camera relied on a digital image sensor similar to that found in common smartphones. Each camera (Blackfly S, FLIR Systems, Oregon, USA) was fitted with an adjustable lens (Fujinon Varifocal Lens, Fujifilm, Tokyo, Japan). Camera resolution was 1280 × 1024 and frame capture speed was 60 frames per second. Cameras were located in front of the subject and oriented with a downward oblique perspective so as to optimize visibility of joint locations throughout task. Distance between subject and cameras was about 1.2 m, and the cameras were spaced about 0.8 m apart from each other. Figure 2 shows the laboratory setup for simultaneous collection of upper limb kinematic data using a CV system and a marker-based motion capture system. As this simultaneous collection was crucial to our validation approach, subjects wore sleeveless shirts throughout the study to accommodate the marker-based motion capture system.

Fig. 2
figure 2

Laboratory setup for data collection. Featuring a 5-camera motion capture system and a dual camera computer vision system, simultaneous recordings of the drinking task activity were captured.

A calibration step was used to extract the extrinsic and intrinsic parameters of the camera—the former describes where the physical camera is in space and the latter describes how pixels of the digital camera image map to the real-world. Human pose estimation involves the identification of key points of the human body by applying advanced machine learning solutions to a two-dimensional digital image. Several of these machine learning solutions are now available in open-source format22,24,25. We have applied one of these to detect key upper limb landmark locations including bilateral shoulders, elbows, and wrists23. Lastly, to arrive at three-dimensional coordinates of key points, a lifting procedure is performed, which was achieved in our multi-camera setup using a direct linear transformation method42. Additional detailed explanation about our CV system is beyond the scope of the current manuscript, which aims to explore feasibility of this system for clinically relevant kinematic assessment.

Marker-based motion capture system

To collect preliminary data on accuracy, kinematic assessment of the upper limb was captured using a conventional marker-based motion capture system (Qualisys AB, Gothenburg, Sweden), which was implemented simultaneously with the CV system. This marker-based system comprised of five optoelectronic cameras with infrared sensors to track reflective markers applied to the participants using double-side tape. The marker setup duplicated those described in prior literature, which have been developed for clinical use and involve 10 reflective markers placed on landmarks of limbs and thorax as well as 2 reflective markers on the cup11. Landmarks included, from distal to proximal: 3rd metacarpophalangeal joint (bilateral), ulnar styloid (bilateral), lateral epicondyle (bilateral), midway of the acromion (bilateral), sternal notch, and midline between the eyes in line with supraorbital ridge.

Post processing

For both CV and marker-based motion capture systems, the raw data comprised a time series of three-dimensional data corresponding to the position of the participant’s body and limbs during the drinking task. The origin and three-dimensional coordinate axes were defined as shown in Fig. 2. For the marker-based motion capture system, this raw position data was post-processed in a Matlab environment (Matlab 2022b, Natick, Massachusetts, USA) using a 2nd order, zero-phase lag butterworth low pass filter with 6Hz cutoff frequency11,43. For the CV system, this raw position data was post-processed in a Matlab environment using a bespoke filter optimized for each kinematic metric. A variety of filters for biomechanical signal filtering have been previously described in literature including Butterworth, Kalman, Moving Average, and Savitsky Golay44,45. For the NMU metric and MT metric, a Kalman filter was applied in accordance to published implementation steps46. Specifically, the state vector was composed of position and velocity for each axis (i.e., statex = [px,vx ], where px and vx represent the position and velocity values on the x-axis), the control vector was set as [1.5, 1.5], and both the process and the measurement noise were set to be zero-mean Gaussian with covariance of 10002*[1/60, (1/60)3/2; (1/60)3/2, (1/60)2] and 602*[1,1; 1,1], respectively. For the TD metric, a Moving Average filter was applied to the raw position data with a 40-count window size. Temporal synchronization of the systems was performed using movement initiation of the upper limb in the y-axis (parallel to tabletop edge) as the synchronizing event. To align reference frames of each system, transformation was applied based on an optimized rotational matrix calculated per the Kabsch method47.

Calculation of the drinking task kinematic metrics (MT, NMU, TD) was completed using Matlab routines custom built according to prior literature descriptions11. For the NMU metric, this measure of movement smoothness is based on the definition of a movement unit. For the drinking task activity, a movement unit is a local minimum in the hand velocity profile followed by a local maximum, which represents an instance of hand acceleration and deceleration. For the MT metric, this measure is based on detecting movement start and stop times, which have previously been defined as when hand velocity exceeds or falls below 2% of peak velocity, respectively. In CV, this detection of movement start and stop can be adversely impacted by signal artefact known as “jitter”48, and to reduce this artefact in data from our CV system, the velocity threshold for start/stop detection was increased from 2 to 5.5%. For the TD metric, this is based on the truncal lean a participant demonstrates while performing the drinking task, which is measurable based on tracking position of the participant’s chest.

To facilitate the comparative analysis, the kinematic metrics calculated for each system were based on anatomic consistency between surface markers and key points. For example, the NMU metric has been defined according to a hand velocity profile created using a surface marker adhered to the third metacarpophalangeal joint. As the CV system in this study did not identify finger key points, the NMU metric for the marker-based motion capture and CV system was calculated as a function of the ulnar styloid marker and wrist key point, respectively. Similarly, the MT metric was also based on use of the ulnar styloid marker and wrist key point. Lastly, as the CV system did not identify a sternal key point, a midpoint between shoulders was used to determine the TD metric for both systems.

Statistical analysis

To evaluate accuracy of the joint position measured by the CV system, the three-dimensional joint position data, as defined in the global coordinate space, was compared against that of the synchronized marker-based motion capture system using a root mean square error (RMSE). As the drinking task activity involved movement of the dominant right upper limb for all participants, the RMSE data was grouped according to the major key points of the right upper limb, e.g. shoulder, elbow, and wrist. To compare RMSE among these major key points, comparisons were calculated using a difference of least square means and p-value less than 0.05 was considered statistically significant.

To evaluate accuracy of the kinematic metrics of the drinking task, the metrics obtained from the CV system were compared against those obtained from the marker-based motion capture system. The kinematic metrics between the two systems were modeled and tested using repeated measures analysis of variance using a compound symmetry correlation structure. In addition, the error for each trial was calculated as the CV system value subtracted from the marker-based motion capture system value as a measure of accuracy. For mean comparisons, p-values less than 0.05 were considered significant, and given the preliminary nature of this study, no correction for multiple comparisons was implemented.

To visually compare the CV system versus the marker-based motion captures system, Bland–Altman plots were constructed for each kinematic metric, which included a full complement of 95% confidence intervals with necessary correction methods due to multiple observations per individuals49. In addition, to improve sample size estimation in future studies, the intraclass correlation coefficient was calculated from the repeated measures analyses for each kinematic metric.

Results

Feasibility

A total of 10 participants were recruited and all participants successfully completed the study protocol. Each data collection session was completed during a morning or afternoon based on convenience to participants. The duration of each session was less than two hours, which included calibration of equipment by research personnel and performance of 5 trials of the drinking task activity by the participant. There were no adverse events during sessions. The raw data from the CV system and marker-based motion capture system was successfully post-processed for all trials across all participants. The desired kinematic metrics of the drinking task activity were achieved from each system, and an illustration of these metrics is shown for a single participant trial in Fig. 3.

Fig. 3
figure 3

Data acquisition. For each participant, time-series of 3D joint position data during the drinking task was acquired using both the computer vision system (CV) and a gold-standard marker-based motion capture system (MB-MoCap) (panel A). By post-processing this joint position data, relevant kinematic metrics could be obtained including a metric to quantify movement quality (panel B) and metrics to quantify movement compensation (panel C and panel D).

Accuracy

The accuracy of joint position was determined for the right shoulder, right elbow, and right wrist. Based on all trials across all participants, the average RMSE for the right shoulder was 52.3 ± 12.0 mm. The average RMSE for the right elbow was 80.2 ± 14.8 mm, and the average RMSE for the right wrist was 60.9 ± 10.6 mm. Comparing between these joint locations, the right elbow RMSE was significantly higher than both the right shoulder and the right wrist (p’s = 0.0002 & 0.0035, respectively). No statistically significant difference was observed between RMSE of the right shoulder and right wrist, p = 0.11.

The accuracy of kinematic metrics obtained from the CV system was determined by comparing measures for the right upper limb with the same measures obtained from the synchronized marker-based motion capture system. Comparison is illustrated in Fig. 4. For the NMU metric obtained by CV and marker-based motion capture, the mean units were 4.42 and 4.54, respectively. The mean values for the TD metric were 33.63 mm and 30.19 mm, respectively. Finally, the mean values for the MT metric were 6.78 s and 6.63 s, respectively. For all kinematic measures, none of the mean values were significantly different between the CV and marker-based motion capture systems (p’s > 0.23). Across all participants and trials, the mean error for the NMU metric, TD metric, and MT metric were -0.12 units (95% CI − 0.38, 0.14), 3.4 mm (95% CI − 0.12, 7.01), and 0.15 s (95% CI − 0.06, 0.36), respectively.

Fig. 4
figure 4

Comparing methods for measuring post-stroke upper limb kinematics. Comparison of mean kinematic metrics reveals no significant difference between the computer vision system (CV) and marker-based motion capture system (MB-MoCap).

Using Bland–Altman analysis, no significant bias was found for any of the kinematic metrics based on presence of the line of equality within the mean difference confidence intervals for each metric (see Fig. 5). For the NMU metric, the limits of agreement ranged from − 1.93 (95% CI [− 2.38, − 1.48]) to 1.69 (95% CI [1.24, 2.14]). For the TD metric, the limits of agreement ranged from − 21.66 mm (95% CI [− 27.96, − 15.35]) to 28.54 mm (95% CI [22.24, 34.85]). For the MT metric, the limits of agreement ranged from − 1.33s (95% CI [− 1.70, − 0.96]) to 1.63s (95% CI [1.26, 2.00]). The intraclass correlation coefficients (ICCs) varied across the kinematic measures with the CV system having lower ICCs except for NMU. For NMU, the ICC was relatively small (0.21 for the CV system and 0.11 for the marker-based motion capture system). However, for TD, the ICCs were quite consistent at 0.36 and 0.40, and for the MT, the marker-based motion capture system had a much higher ICC of 0.63 while the CV system was almost half at 0.34.

Fig. 5
figure 5

Bland–Altman analysis. For each kinematic metric, Bland–Altman plots provide a comparison of potential bias in computer vision system, which is represented by comparison of the mean difference line (solid red line) to the line of equality (solid blue line). Additionally, the random error of the data is illustrated by the limits of agreement (dotted green lines) and confidence intervals for both mean difference and limits of agreement are shown in shaded bands (red band and green band, respectively). Due to considerable overlapping of data points when plotting NMU (left), a “jitter plot” option has been applied to reveal the individual data points.

Discussion

This pilot study investigated the application of CV technology for measuring post-stroke kinematic metrics of the upper limb that have been recommend for standardized use14. Our primary objective was to determine feasibility of this approach among a sample of adult neurotypical participants, and our secondary objective was to assess accuracy of this approach. As evidence of feasibility, the data collection protocol was well-tolerated by all participants, and joint position data was successfully extracted by a CV system for all trials attempted. Furthermore, for all trials across all participants, three kinematic metrics for the post-stroke upper limb were successfully obtained. Based on comparison with a synchronized, gold-standard marker-based motion capture system, preliminary evidence suggests no significant difference between the kinematic measurements by CV and those by the gold-standard system.

As with any assessment during stroke rehabilitation, the clinical uptake of kinematic assessment depends on a balance between acceptability and accuracy50. Conventional marker-based approaches have often served as a gold standard approach to acquire kinematics51, but this approach presents obvious challenges to clinical acceptability including high expense and burdensome marker placement15. Electromagnetic systems represent a potentially portable option with adequate accuracy to measure large dynamic movements in a single reference plane52. These systems do require specialized hardware susceptible to electromagnetic interference, and lower sampling rates may be problematic to high frequency human movements53. Wearable sensors offer a relatively affordable and highly mobile solution capable of measuring upper limb joint angles with an RMSE less than 7°54,55. While price and miniaturization continue to progress, wearable sensors inherently require placement of and maintenance of physical devices on an individual, which is not a trivial issue when considering neurodivergent populations or when considering small anatomic landmarks such as the hand joints during reach-to-grasp movements.

By eliminating the need for physically worn devices, CV technology offers inherent benefits to acceptability and studies have explored potential tradeoffs in accuracy. Several studies have examined the accuracy of CV systems that combine cameras and depths sensors in a single device (i.e. RGB-D cameras). Using a single such device to measure the lower limb compared to a marker-based system, the average absolute difference in hip and knee flexion among healthy participants was 4.3 and 1.4°, respectively56. In a similar study comparing an RGB-D camera versus an electrogoniometer, the sagittal-plane hip and knee angles for healthy participants revealed an RMSD of 1.76 and 2.04°, respectively57. When measuring the upper limbs during a lifting and truncal lean activity, an RGB-D camera was found to have an RMSE of 27 and 47 mm, respectively, as compared to a gold standard. In a study measuring joints of the hand, a single healthy participant performed hand spreading and pincer grip activities; compared to a marker-based system, the RGB-D camera demonstrated an average absolute deviation of 2.4, 4.8, and 4.8° at the MCP, PIP, and DIP, respectively58. In a recent preliminary study using a single device during drinking task activity, the RMSE for elbow flexion measured in a single health participant was 16.9°59.

With advances in human pose estimation, kinematic assessment can be done with more basic cameras akin to those found in consumer webcams and smartphone cameras. Studies have explored the accuracy of CV systems that use such modest cameras. When evaluating treadmill walking in healthy participants with multiple monochrome cameras, kinematic metrics in sagittal and frontal planes were comparable to that of a marker-based system60. In a similar study of overground walking, the joint locations of the upper and lower limbs were measured with a root mean square difference (RMSD) up to 24 mm and 36 mm, respectively, and the RMSD for joint angles ranged from a minimum 2.6° (hip flexion/extension) to maximum 13.2° (knee internal/external rotation)16. In studies of more diverse activities (jumping, ball throwing), a collection of modest cameras successfully measured joint locations with a mean absolute error ranging from 25 to 67 mm (upper limbs) and from 25 to 42 mm (lower limbs)17. In a related study measuring joint angles during functional mobility activities (stepping down, run/cut), an 8-camera setup measured ankle and knee flexion in healthy participants with RMSD < 6°19.

In our study of a modest dual-camera setup, errors in joint position were comparable to the aforementioned studies with exception of slightly increased error in elbow joint position. There is a likely explanation to the increased error in our study—namely, the definition of joint position. For our experimental CV approach, the joint position is based on a joint center approximated by human pose estimation solutions. That is, joint position of the shoulder is based on an estimated center of the glenohumeral joint. For our gold standard approach, we implemented a well-protocolized, clinically oriented setup for marker-based motion capture, which relies on a limited number of reflective surface markers adhered to the participant’s skin. In this case, joint position of the shoulder is based on a superficial marker placed near the midpoint of the lateral acromion. The surface marker position was used in our validation setup as a surrogate for joint center position, which inherently introduces an offset into our error calculation. While it is possible to use multiple surface markers and biomechanical modeling software to model the joint center position61, this was not applicable to our validation setup.

For joint position of the elbow, our CV approach revealed less accuracy compared to joint position of the shoulder and wrist. This is consistent with prior studies in which participants performed other activities of the upper limb such as arm swing during walking16. This suggests a potentially important phenomenon specific to the elbow, which may be activity-agnostic. A plausible explanation is labeling error in the training data for human pose estimation solutions48. Regardless of the cause, knowledge about joint-specific trends in error may be helpful as future kinematic metrics are developed. Metrics that minimize reliance on more error-prone joint positions may be prioritized. For example, MT and NMU depend solely on the wrist joint position and are independent of the elbow joint position.

Regarding errors in the kinematic metrics (NMU, TD, MT) of the drinking task, the authors are unaware of previous literature that has compared a CV approach to a gold standard approach. As mentioned above, the definition of joint position may contribute error to the drinking task kinematics. In addition, surface markers on the skin introduce known artifacts due to movement of the skin over underlying bony structures, and these skin artifacts are known to contribute errors to marker-based kinematic metrics62. Thus, skin artifacts in our gold-standard comparator may contribute to our error calculations.

While direct comparison to other studies is limited, the errors in kinematic metrics can be clinically interpreted based on foundational studies of the drinking task. Namely, in a series of marker-based motion capture studies by Alt Murphy, the discriminant properties and clinical correlates of drinking task kinematics have been determined40,43. Considering metrics of compensation, participants with mild-moderate stroke demonstrated a TD approximately 50 mm more than a cohort of healthy controls, which far exceeds the mean difference and RMSE determined in our study. Similarly, duration of the drinking task (MT) for individuals affected by stroke is approximately 4.9 s more than for individuals without stroke, which again far exceeds the mean difference and RMSE for MT as measured by our CV approach. Based on correlations with the Action Research Arm Test, real clinical improvement in TD, MT, and NMU has been quantified as changes of 20–50 mm, 2.5–5 s, and 3–7 units, respectively, which exceed the error of the CV approach in the present study43.

Previous studies with CV have suggested potential bias when compared to marker-based approaches60. This pilot study, however, revealed no significant bias in the Bland–Altman analysis. Of note, the MT metric exhibited a mean difference that uptrends with magnitude of the mean value, which suggests heteroskedasticity. While between session reliability was not possible in the current pilot study, prior studies in gait kinematics suggest CV may excel when compared to marker-based motion capture63. This has historically been attributed to variations in marker placement, which may be dependent on the anatomic knowledge and experience of the assessor.

Future opportunities

There are several opportunities to build upon the data acquisition protocol of our preliminary validation study. As mentioned, a marker-based approach is often considered a gold-standard for kinematic assessment, and by increasing the number of markers, biomechanical modeling can be used to compensate for the drawbacks of surface markers. Additionally, line-of-sight is a common challenge to both marker-based and CV approaches. A marker-based system becomes ineffective when markers are hidden from camera view, such as might happen when loose clothing shifts or when limbs rotate from view, e.g. a dorsal wrist marker that escapers sight when the forearm supinates. While a CV approach is able to handle visibility issues that might present from loose-fitting garments, the accuracy is likely to benefit from form-fitting clothing. For research purposes, line-of-sight issues can be mitigated in both approaches by increasing the number of cameras, by improving calibration, and by employing cameras with higher resolution and frame rate. However, evaluators should be mindful of a balance among expense, complexity, accuracy, and acceptability. Regarding synchronization of the validation setup, a manual approach was utilized in our study, and an automated synchronization would likely benefit efficiency of the data collection.

Beyond data acquisition, there are also several opportunities to advance the signal processing methods of our validation study. We utilized a single solution for human pose estimation, and by exploring alternative solutions, the accuracy and performance of the CV system may be enhanced. For signal filtering, several options exist as demonstrated in the bespoke filters we applied for the different kinematic metrics. Rigid body filtering has been described in literature. Rigid body filtering employs biomechanical modeling with scaled virtual skeletons, which may mitigate signal artefact (e.g. “jitter”) by leveraging the existence of anatomic constraints during data post-processing18.

Limitations

There are important limitations to this pilot study. Our data acquisition focused on a small sample of neurotypical participants. While important groundwork is laid, the generalizability of our findings to individuals with history of stroke is subsequently limited. For example, if comparing our participant sample and a hypothetical sample of stroke survivors, the accuracy estimates of the CV system at the group level would expectedly be different for stroke survivors based on the wider movement variations between trials (e.g. wider variety of movement unit patterns, wider extent of trunk displacement). To improve generalizability, future studies will benefit from increased sample size, inclusion of neurodivergent populations, and consideration of more diverse demographics (e.g. age, laterality of hand dominance). In our pilot study, we considered only a single session of data acquisition with each participant, which limits our analysis of repeatability. For our gold standard approach, we chose a reduced marker setup (e.g. single marker on sternal notch) to replicate prior study protocols on the drinking task, but this subsequently limits the available kinematic information (e.g. trunk rotation is not measured) and limits more sophisticated biomechanical modeling (e.g. inverse kinematics). Likewise, our CV system identifies only a limited number of key points and excludes many other landmarks such as digits of the distal hand. While in our pilot study, digits of the hands were less important for the kinematic metrics of interest (e.g. NMU, MT, TD), these landmarks are an important future target to consider given that the manipulation of objects is prevalent throughout daily life. Fortunately, these landmarks can be achieved depending on the machine learning solution utilized for human pose estimation. Lastly, the drinking task is a single reach-to-grasp activity, which represents a limited view of a person’s activities of daily living. Future studies may consider a wider spectrum of upper limb activities of daily life as well as activities that are personalized according to individual’s interests and values.

Conclusion

Based on a pilot study in neurotypical participants, computer vision is a feasible method for measuring kinematic metrics that have been recommended for standardized use in rehabilitation research involving the post-stroke upper limb. Future research is needed to investigate the validity of this technology in people affected by stroke.