Introduction

Musculoskeletal disorders are ranked second in contributing years lived with disability worldwide [34]. Within the spectrum of musculoskeletal disorders, neck pain is a very common condition with a high age-standardized lifetime prevalence of 66.7% [5] and 12-month prevalence rates varying from 20 to 40% [9].

People with neck pain commonly present with altered physical function including neck muscle weakness [14, 15, 37]. Neck muscle strength training is known to be an effective intervention for patients with neck pain [1, 8], and an association exists between the extent of the reduction in neck pain and disability and an increase in neck strength following neck strengthening in people with chronic neck pain (CNP) [36]. The measurement of neck strength is therefore relevant to determine the presence of neck muscle weakness and to monitor strength changes over time as it serves as an important objective marker throughout the course of rehabilitation as are other objective markers [16].

Numerous methods have been used to evaluate neck strength, including manual muscle testing [24], hand-held dynamometry [32], strain-gauge dynamometry [12], isometric [35], and isokinetic tests [2] and specialized equipment such as the multi cervical unit [4]. It is imperative that clinicians are utilizing performance-based outcome measures (PBOM) that meet certain benchmarks for measurement properties to ensure the highest clinical accuracy [7]; the COSMIN initiative (Consensus-based Standards for the selection of Health Management Instruments) have standardized the terminologies and taxonomy of relevant measurement properties for instrument evaluation under a consensus-based approach [21,22,23], which are reliability, validity, and responsiveness.

A systematic review conducted by de Koning et al. [6] evaluated clinimetric properties of tests of neck muscle functioning in patients with neck pain. However, it primarily focused on the measurement properties of measures for neck muscle endurance. The review highlighted the lack of portable neck strength assessment tools that can examine neck strength in a reliable manner. More recently, Selistre et al. [29] conducted a systematic review exploring clinical tests utilized to measure neck muscle strength or endurance in participants with non-specific CNP or asymptomatic participants. However, the authors only included tests that could be performed within a maximum of 5 min and involved equipment with a maximum cost of €1000, which limited the number of tests considered. Thus, the review was not able to provide an overview of all methods currently tested for their measurement properties for the assessment of neck strength in people with CNP; it is relevant to understand how the measurement properties (e.g., reliability) of low-cost approaches compare to those of a ‘gold-standard’ (i.e., isokinetic dynamometer). Additionally, the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach was not adopted to examine the overall quality of evidence regarding the measurement properties, yet this is an important process to appreciate the trustworthiness of summarized results.

Thus, in the current systematic review, we aimed to appraise the psychometric properties of various neck strength outcome measures (without limits on the duration of testing or cost of the equipment) and establish their appropriateness for the evaluation of neck strength in patients with chronic neck pain based on their measurement properties. This rigorous systematic review applied the COSMIN Risk of Bias checklist, and the study results were rated against the COSMIN criteria for good measurement properties. Additionally, the GRADE approach was used to draw conclusions on the overall strength of the evidence.

Methodology

The reporting of this systematic review adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) checklist [13]. The review was designed based on the COnsensus-based Standards for the selection of health Measurement INstruments COSMIN methodology [19,20,21,22,23]. A registered summary of this protocol is available on PROSPERO (CRD42021233290).

Eligibility criteria

Inclusion criteria

For studies to be included in this systematic review, they were required to meet the following eligibility criteria: (1) Target population: studies with adult participants (>18 years), who are experiencing CNP of either non-traumatic or traumatic origin; (2) Outcome measure: studies investigating PBOM of neck strength (manual, mechanical, and functional techniques); (3) setting: studies that evaluate the measurement properties of PBOM of neck strength in a laboratory, clinical, or field-based environment; and (4) Measurement properties: studies that evaluate one or more clinimetric properties of PBOM based on the COSMIN taxonomy (e.g., reliability, validity, and responsiveness) [21,22,23].

Exclusion criteria

Studies were excluded according to the following criteria: (1) Language: studies published in a language other than English due to restricted ability in language translation. (2) Article type: studies that were either conference abstracts, articles without full-text availability or systematic review articles; and (3) Study demographic: the study only evaluated asymptomatic participants.

Information sources and search strategy

A comprehensive literature search was conducted using medical subject headings and free text, and relevant keywords were identified during scoping searches. MEDLINE (OVID interface), CINAHL, SPORTDiscuss via (EBSCO interface), EMBASE (OVID interface), and Web of Science were electronically searched from inception until 21 June 2021 to maximize literature coverage, as per Cochrane collaboration recommendations [10]. To identify additional literature, a hand searching of reference lists of relevant articles was conducted. Gray literature and conference papers were searched to reduce potential publication biases.

The search strategy was established with the MEDLINE database, and changes and adaptations were made when undergoing search processes in other databases. The search strategy used in MEDLINE (OVID interface) is reported in Additional file 1: Appendix 1. Specific search terms included keywords and Medical Subject Headings (MeSH) terms related to the neck region, muscle strength, and psychometric properties, e.g., reliability, validity, and responsiveness. Terms describing demographics of interest were also included. In addition, relevant search filters constructed by COSMIN for the purpose of identifying appropriate studies on measurement properties were used [27].

Study selection

The first reviewer [JT] performed an extensive electronic search on the aforementioned databases. All search results were recorded and exported to EndNote Version X9 (Clarivate analytics) software for abstract and full-text storage. This enabled duplicated studies to be recognized and removed from the software.

Based on the eligibility criteria established, two reviewers [JT, DA] independently screened study titles and abstract and designated studies into three subcategories namely “include,” “exclude,” and “unsure” [17]. In addition, each reviewer independently read the full texts that were categorised as “unsure” and assessed against the eligibility criteria [10]. The authors were contacted via email if additional information was needed. Any disagreement regarding study eligibility was resolved either by consensus or involvement of the third reviewer [DF]. The rationale for the exclusion of studies is reported in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram summarzing the number of articles included at each stage of the review

Data collection process and data items

A standardized form was used to extract relevant data from each included study. Piloting the data collection form was carried out to ensure the collection of all relevant information. Both reviewers [JT, DA] independently utilised the standardized form to extract relevant data. The third reviewer [DF] was available to discuss about any potential disagreements regarding extracted data if needed until concurrence was established. Additional file 1: Appendix 2 outlines the extracted from utilised for the included studies.

Risk of bias in individual studies

The COSMIN Risk of Bias checklist was implemented to assess the risk of bias in included studies with the utilization of the original COSMIN tool that demonstrates a high level of inter-rater agreement [18,19,20,21,22,23]. It comprises of standards for design requirements and preferred statistical methods of studies on measurement properties, with ten COSMIN boxes encapsulating benchmarks for PROM development and for nine aspects of measurement properties (reliability, validity, and responsiveness) [27]. The two reviewers individually rated each outcome measure as either very good, adequate, doubtful, or inadequate quality [27]. Disagreements were resolved between the reviewers, and the third reviewer was available to intervene if required for reaching consensus.

Data synthesis

The characteristics of the included studies in this review were found to be heterogenous in nature (study demographic, methodological design, outcome measures, and statistical design). As a result, it was not possible to be carry out a meta-analysis, and a narrative synthesis was conducted instead. The narrative synthesis was completed in accordance with the COSMIN guidelines for systematic reviews [27]. Results of the included studies per measurement property, per outcome measure, and per test direction were quantitatively pooled and evaluated against the COSMIN criteria for good measurement properties to establish whether the measurement property was sufficient (+), insufficient (−), inconsistent (±), or indeterminate (?) [27].

Quality of the evidence

A modified GRADE approach was adopted to examine the quality of evidence and the trustworthiness of summarized results [19, 20]. The grading of the quality of evidence was listed as high, moderate, low, or very low evidence. Following the COSMIN recommendation, four determinants of quality of evidence were used: (1) risk of bias (methodological quality of the studies), (2) inconsistency (unexplained inconsistency of results across studies), (3) imprecision (total sample size of the available studies), and (4) indirectness (evidence from different populations than the population of interest in the review). The fifth factor on the GRADE approach, publication bias, was not taken into account due to the lack of registries for studies on measurement properties [19, 20].

Results

Study selection

Figure 1 summarizes the articles included at each stage of the review. A total of 794 articles were identified following searches on electronic databases. After duplicate studies were removed, 580 articles were screened at title and abstract stage, with 39 assessed at full-text stage. Finally, a total of 9 studies were included in this review.

Study characteristics

Tables 1 and 2 present the study characteristics and results of the included 9 studies. One study [26] specifically investigated a population with Whiplash-Associated Disorder (WAD). The remaining 8 studies carried out investigations on chronic neck pain including one study which had a mixed patient group of WAD and non-specific chronic neck pain. All studies investigated reliability [3, 4, 11, 25, 26, 28, 30, 33, 37], one investigated validity [4], but no studies evaluated responsiveness. Neck strength measures evaluated were a handheld dynamometer (HHD) [3, 30], isometric dynamometer [25], strain gauge dynamometer (SGD) [11, 37], modified sphygmomanometer dynamometer (MSD) [33], multi-cervical unit (MCU) [4, 26], and multifunctional measurement unit [28]. The measurement procedures for the individual studies are presented in Additional file 1: Appendix 3.

Table 1 Study characteristics of included studies
Table 2 Summary of results of the included studies

Risk of bias and overall quality of evidence

Table 3 summarizes the risk of bias for individual studies categorised per neck strength outcome measure and measurement property. Overall, the risk of bias was rated as doubtful or inadequate for most reliability studies, with only one study [30] rated as adequate. The study evaluating validity [4] was rated doubtful. The overall quality of evidence was rated low or very low for the measurement properties of all neck strength measures.

Table 3 Summary of risk of bias, criteria for good measurement properties, and overall quality of evidence (GRADE)

Synthesis of results

Validity

None of the studies included in this review evaluated content validity or criterion validity, with just one study focused on construct validity [4]. Due to the absence of “gold standard” in measuring isometric neck strength, direct comparison was not applicable to establish validity. Instead, a method of contrast group comparison was used to compare mean isometric neck strength between people with and without neck pain. The risk of bias was rated as doubtful and indeterminate for the COSMIN good criteria for good measurement properties. Overall, this study yielded very low quality of evidence for the construct validity of isometric neck strength.

Reliability and measurement error

Handheld dynamometer

One study evaluated intra-rater [3], and one evaluated inter-rater reliability of HHD [30]. Cibulka et al. [3] used a Microfet HHD (Hogan Health Industries, UT, USA), while Shahidi et al. [30] used a FPIX HHD (100kg load cell, Wagner Instruments, CT, USA) for testing. One reported excellent intra-rater reliability [3], and the other reported acceptable inter-rater agreement across time [30]. The risk of bias was rated doubtful [3] and adequate [30] for intra- and inter-rater reliability, respectively. The intra-rater reliability study rated sufficient [3] on the COSMIN criteria while the other study was rated insufficient for inter-rater reliability [30]. Very low overall quality for both intra- and inter-rater reliability for HHD indicates very limited confidence in the reliability estimate within CNP population.

The same studies investigated measurement error [3, 30], with results summarized in Table 2. For the risk of bias, one study was rated doubtful [3] and the other study was rated adequate [30]. Both studies were rated as indeterminate for the COSMIN criteria for good measurement properties. Moderate overall quality indicates moderate confidence in measurement error estimates of HHD to measure neck strength of people with CNP.

Isometric dynamometer

One study evaluated test-retest reliability of isometric dynamometer [25] using a NeckMetrix dynamometer (UniQuest Pty Ltd., The University of Queensland, Australia) with overall conclusions reported as good reliability over two sessions of maximal voluntary isometric contraction measurement. The risk of bias was rated inadequate, and test-retest rated as sufficient on the COSMIN criteria for good measurement properties. Overall, very low-quality evidence indicates very little confidence in the reliability estimate of isometric dynamometer within the CNP population.

Measurement error was evaluated in the same study [25]. The risk of bias was rated as inadequate, with the COSMIN criteria rated as indeterminate. The overall low quality indicates little confidence in the measurement error of isometric dynamometer within the CNP population.

Strain gauge dynamometer

Two studies evaluated intra-rater reliability of SGD (Neck Exercise Unit, Follo, Norway [11];), the other study used a neck strength measurement system with 2 parts having strain gauges of their own (Kuntovaline Inc, Helsinki, Finland [37];), both studies reported good reliability, with ICCs ranging from 0.74 to 0.96 [37] and correlation coefficient ranging from 0.938 to 0.968 [11]. The risk of bias was rated inadequate [11] and doubtful [37]. Both studies were rated indeterminate on the COSMIN criteria for good measurement properties. Low overall quality indicates limited confidence in the reliability estimates of SGD within the CNP population.

Measurement error was investigated in one study [11]. The risk of bias was rated as doubtful and indeterminate on the COSMIN criteria for good measurement properties. Low overall quality indicates little confidence in the measurement error of SGD within the CNP population.

Modified sphygmomanometer dynamometer

One study evaluated intra-rater reliability of MSD using a Comparative Muscle Tester (Magnatec Co. Ltd., Ontario, Canada). Overall conclusions reported high level of accuracy, performance-related reliability, and consistency [33]. The risk of bias was rated inadequate with a rating of indeterminate on the COSMIN criteria for good measurement properties. Very low quality for intra-rater reliability of MSD indicates very limited confidence in the reliability estimate within the CNP population. No studies were identified for measurement error with this outcome measure.

Multi-cervical unit

Two studies evaluated test-retest reliability the MCU, both reporting good to excellent reliability (MCU, BTE Technologies, Inc., [26]; Hanoun Medical Inc., Ontario [4];). The risk of bias was rated as doubtful for both studies, and both rated sufficient for the COSMIN criteria for good measurement properties. Low overall quality test-retest reliability for MCU indicates limited confidence in the reliability estimate within the chronic neck pain population.

One study investigated measurement error with results summarized in Table 2 [26]. The risk of bias was rated as doubtful, with the COSMIN criteria for good measurement properties rated as indeterminate. The very low overall quality indicates very little confidence in measurement error estimate for the MCU within a CNP population.

Multifunctional measurement unit

One study evaluated intra- and inter-rater reliability of multifunctional measurement unit using Back Check 607 [28]. Overall conclusions were reported as excellent intra- and inter-rater reliability. The risk of bias was rated as doubtful and a rating of sufficient for COSMIN criteria for good measurement properties. Very low overall quality for both intra- and inter-rater reliability indicates little to very little confidence in the reliability estimates for multifunctional measurement unit within the CNP population. No studies were identified for measurement error with this outcome measure.

Responsiveness

No studies were identified which evaluated responsiveness.

Discussion

This systematic review, which evaluated outcome measures of neck strength and their measurement properties in people with CNP, identified six measures used to evaluate neck strength, with the majority of the research investigating people with CNP of non-traumatic origin. The variety of outcome measures found to assess neck strength demonstrates the lack of agreement and gold standard regarding the most appropriate measure for neck strength. To ensure comprehensiveness, all available measures were included in this review. Nevertheless, our review revealed that a consensus on the most optimal outcome measure is still needed to facilitate future research for greater standardisation of neck muscle strength measures across studies.

Reliability was evaluated for all six measures; measurement error was evaluated for the HHD, isokinetic, and isometric dynamometers, SGD and MCU; and validity was evaluated only for the MCU, but no study evaluated responsiveness. The risk of bias for all studies was rated as doubtful or inadequate apart from the study which investigated inter-rater reliability and measurement error of a HHD, which was rated as adequate [30]. For reliability, the overall quality was rated as very low for all outcome measures aside from SGD and MCU which was rated low. All these studies contained small sample sizes with poor overall methodological quality, hence contributing to the high risk of bias and low overall quality for reliability. For measurement error, the HHD was rated moderate for overall quality of evidence, whilst isometric dynamometry and SGD were rated as low. The isokinetic dynamometer and MCU were rated very low for overall quality. For the validity, the quality of evidence was rated very low due to imprecision, as the total sample size of the study was less than 50.

Several factors in the reliability studies included in this review contributed to the high risk of bias score and low or very low overall quality of evidence for each measure. Besides impreciseness, the quality of the methodology in many studies was varied as information regarding the study design was lacking, particularly in the description of experimental preparation, examiners/raters’ positions, and their expertise or training using the measurement tool. Two important aspects of internal validity, randomization and blinding of raters, were also poorly documented across studies. Both elements of the study design are fundamental methodological features in avoiding selection bias and insuring against accidental bias [31]. The reported time interval between measurements were inconsistent amongst studies, varying from seconds to weeks. According to COSMIN, 2 weeks are the recommended time interval for PROM measurements [19, 20]. However, in the context of evaluating neck muscle strength, a period of 2 weeks [25] could be argued to be too long, as it provides time for changes in neck muscle strength to potentially take place. On the other hand, an interval of 1 min [3] is likely to allow recall bias to occur in participants due to a lack of a washout period. Establishment of consensus on a standardized time interval is warranted to minimize measurement variations and improve methodological quality of future studies. Furthermore, variations in muscle testing protocols were observed across studies, which potentially influence the reliability or validity of each neck muscle strength measure, making it difficult to establish the most appropriate neck muscle strength outcome measure without consistent measurement procedures.

Another issue found within the studies is the obscurity around statistical measures used to evaluate the reliability and measurement error of measures. Some studies did not describe the model or formula used for statistical analysis of data. COSMIN recommends the intraclass correlation coefficient as the preferred statistical method for continuous scores in evaluating reliability [19, 20]; however, this was not carried out in one study [11].

Methodological considerations

Some limitations of the present review are recognised and should be mentioned. Only articles that were published in English were included. Moreover, as the results were found to be heterogeneous, a meta-analysis was not applicable. Instead, a narrative synthesis was conducted to recapitulate the findings. Based on the low quality of the studies included, firm conclusions or recommendations could not be made regarding the most appropriate neck muscle strength outcome measures to use to evaluate neck strength and monitor changes in patients with CNP.

Implications

The findings from this systematic review have the following future implications for research and clinical practice:

  1. 1.

    A range of outcome measures are used to examine neck muscle strength and as such, there remains a lack of consensus and standardized approach in performing neck strength measurements.

  2. 2.

    This review unveiled methodological flaws in existing studies evaluating measurement properties of neck strength measures. Future research should carefully consider study design and reporting of results (e.g., better description of examiners, adequate time between measurements, reporting of blinding of examination, outlining statistical model for data analysis, etc.) in order to ensure future results with higher overall quality of evidence.

Conclusion

This systematic review examined the measurement properties of six outcome measures used to evaluate neck muscle strength in people with CNP. Apart from one study evaluating reliability and measurement error, the risk of bias for all studies was rated as doubtful or inadequate. The overall quality of evidence for all measurement properties was rated as low or very low, apart from measurement error of a handheld dynamometer. Due to variability in methodologies and statistical methods, it was difficult to establish the reliability of various neck strength measures, in order to recommend an optimal outcome measure to evaluate neck muscle strength in people with CNP. Further high-quality research is required to evaluate measurement properties of neck muscle strength measures in order to determine the most appropriate measure for future use.