FormalPara Key Points for Decision Makers

This study found that generic measures of health-related quality of life (HRQL), such as the EQ-5D, showed relatively little differentiation between patients with alopecia areata (AA) by extent of hair loss and treatment response.

Disease-specific measures of the impact of AA had larger differences across known groups compared with generic measures of HRQL, including the EQ-5D-5L and SF-36v2.

If the EQ-5D-5L does not reflect the burden of AA on patient’s HRQL, then cost-effectiveness analyses of treatments in AA may underestimate the value of those treatments.

1 Introduction

The EQ-5D-5L is a generic measure of health-related quality of life (HRQL) which is used in the estimation of quality-adjusted life years (QALYs) [1]. It is a preferred measure of HRQL for several decision makers such as the National Institute for Health and Care Excellence (NICE) in the UK [2, 3]. NICE supports the use of the measure for a number of reasons. Primarily, it provides a high degree of standardisation which reduces uncertainty in decisions. The EQ-5D-5L has been shown to be able to measure the health impact of a wide range of conditions and so, in principle, resource allocation decisions for new health technologies should not favour one disease area over another [4,5,6,7].

The EQ-5D-5L measures five dimensions of health and there has been an ongoing discussion regarding whether the EQ-5D-5L misses important areas of health for certain diseases [8]. The EuroQol Group have supported a research programme to develop and test ‘bolt-on’ dimensions in disease areas including respiratory disease [9] and psoriasis [10, 11], as well as vision, hearing and tiredness [12]. If the five dimensions of the EQ-5D-5L do not capture the impact or burden of a disease, then it is likely that cost-effectiveness analyses of treatments in these diseases will underestimate the true value of an effective treatment.

Alopecia areata (AA) is an autoimmune disease that leads to non-scarring hair loss ranging from small bald patches to complete loss of scalp (alopecia totalis), face and/or body hair (Alopecia Universalis) [13]. AA is a prevalent condition, with an estimated 18.4 million cases worldwide in 2019 [14]. Recovery or regrowth of hair is highly unpredictable. Hair regrowth can occur, sometimes with different hair quality and/or colour [15]. Research has described how AA can have a significant HRQL impact on patients, particularly in the area of psychosocial functioning [16, 17]. This can include impacts on confidence, self-consciousness, self-image and social relationships [18,19,20,21,22,23]. Evidence also suggests that the HRQL impact can be greater for people with more extensive hair loss [16, 17]. More extensive scalp hair loss has been shown to be correlated with impairments to social functioning, ability to carry out daily activities and mental health as measured by dermatology-specific (dermatology life quality index [DLQI] and Skindex-29) [21] and generic HRQL instruments (i.e., SF-36) [24].

Treatments are emerging that act on the physiological cause of AA. Following approval, these treatments are expected to undergo health technology assessment (HTA) review, which often require HRQL data (commonly from the EQ-5D) for QALY estimation [25,26,27,28,29]. NICE have recently published a negative appraisal of baricitinib for the treatment of severe AA. One reason for this decision given by the NICE was that baricitinib did not show meaningful improvements in HRQL compared with placebo in phase 3 trials [30]. However, concerns have been raised from various stakeholders over the validity and sensitivity of the EQ-5D when used to assess AA [31, 32]. In the baricitinib technical appraisal, the British Association of Dermatologists (BAD) highlighted the significant psychosocial impacts of AA (e.g. self-confidence and social functioning), many of which are not captured by the EQ-5D. The BAD also described difficulties in capturing the true impact of treatments in AA using QALYs [31].

Several studies have demonstrated that the EQ-5D-5L has acceptable psychometric properties in other dermatological conditions [33,34,35]. However, it has also been criticised because important HRQL concepts are missing for these patients [11, 36]. NICE recently commented that “the EQ-5D often fails to capture quality-of-life improvements for people with skin conditions” [36]. Indeed, important HRQL concepts in AA, such as self-confidence and self-consciousness, may not be adequately represented by the EQ-5D dimensions. Our literature searches conducted to date have identified no studies that were designed to test the psychometric properties of the EQ-5D-5L in AA. Given the concerns of NICE and the BAD, we considered it important to examine the measurement properties of the EQ-5D-5L in people with AA.

This study describes analyses which were designed to explore the psychometric properties, including ceiling and floor effects, responsiveness, factor structure, convergent validity and known-groups validity, of the EQ-5D-5L in people with AA. Psychometric performance of the EQ-5D is compared with another generic HRQL measure (SF-36v2) and a condition specific measure (Alopecia Areata Patient Priority Outcomes Questionnaire [AAPPO]) [37].

2 Methods

2.1 Data Source

Adult data from the multinational ALLEGRO-2b/3 trial (NCT03732807) of ritlecitinib were used for this analysis. The trial was conducted in multiple countries including Canada, USA, Mexico, Colombia, Chile, Argentina, Hungary, Czech Republic, UK, Germany, Poland, Spain, Russia, Japan, South Korea, Taiwan, China and Australia. All study participants had ≥ 50% scalp hair loss at baseline and participants were followed up for 48 weeks. Study participants’ hair loss was assessed using the Severity of Alopecia Tool (SALT), a standardised measure assessed by a physician. This study also administered patient-reported outcome (PRO) measures including the AAPPO, EQ-5D-5L, SF-36v2 and Patient Global Impressions of Change (PGI-C).

2.2 Measures

2.2.1 AAPPO

The AAPPO has been shown to be a reliable and valid AA-specific PRO measure which assesses the impact of AA in terms of hair loss, emotional symptoms and activity limitations [37]. It includes four items to assess hair loss (scalp, eyebrow, eyelash and body hair), four items evaluating emotional symptoms (self-consciousness, embarrassment, sadness and frustration) and three items for activity limitations (outdoor activity, physical activity, interactions with others) [37]. Each item is scored from 0 to 4, with higher scores indicating greater impacts. AAPPO subscale scores were computed as the mean of relevant items.

2.2.2 EQ-5D-5L

The EQ-5D-5L is a generic measure of health which assesses mobility, self-care, usual activities, pain/discomfort and anxiety/depression, plus an overall self-rating of health (EQ-VAS). Each dimension is rated by the individual using five response levels [1]. The measure can be summarised as a preference weighted single index score. EQ-5D-5L index scores range from 0 (equal to being dead) through to 1 (full health). Negative values represent health states that are considered worse than being dead. EQ-5D-5L data were scheduled for collection on five separate visits from baseline until week 48 (baseline and weeks 4, 12, 24 and 48).

2.2.3 SF-36v2

The SF-36v2 is also a generic measure of HRQL which assesses health in eight domains and two summary scores (Physical Component Summary score and Mental Component Summary score) [38]. These summary scores provide a simple norm-based method for describing the state of an individual’s HRQL. Summary scores are transformed to have a mean of 50 and standard deviation of 10 based on normative data from the US general population. Scores range from 0 to 100, with higher scores indicating better health.

2.2.4 PGI-C

The PGI-C is a single item PRO administered to assess improvement or worsening of participants’ AA at week 24 and 48 compared with baseline on a 7-point Likert scale from 1-greatly improved, 2-moderately improved, 3-slightly improved, 4-not changed, 5-slightly worsened, 6-moderately worsened and 7-greatly worsened.

2.2.5 SALT

The SALT score is computed by measuring the percentage of hair loss in each of four areas of the scalp—vertex (40%), right profile (18%), left profile (18%) and posterior (24%)—and adding the total to achieve a composite score [39]. The SALT score was the primary outcome measure in the ALLEGRO clinical trial [40].

2.3 Analyses

All HRQL measures were scored using standard methods. EQ-5D-5L was scored using UK weights as recommended by NICE [41]. The data were summarised for three key timepoints where measures were co-administered (baseline, week 24 and week 48) unless otherwise specified. Descriptive sample data were summarised at baseline only and included demographics and AA characteristics.

2.3.1 Ceiling Effects

HRQL scores from each measure and the counts and proportion of participants scoring the worst (floor effects) or best possible scores (ceiling effects) in an instrument (e.g. 11111 in the EQ-5D) were summarised. The data from other outcome measures were compared for participants who rated themselves in EQ-5D-5L state 11111 compared with any other EQ-5D-5L state. This was designed to explore variability in other outcomes for people who rated themselves in full health. It was hypothesised that substantial variability in HRQL scores on the SF-36v2 for patients in state EQ-5D-5L state 11111 would support the hypothesis that EQ-5D-5L demonstrates a ceiling effect.

2.3.2 Convergent Validity

The correlations between EQ-5D-5L, other HRQL measures and the SALT scores were used to assess the convergent validity in patients with AA. The strength of correlation was interpreted as: weak (r < 0.4) moderate (r > 0.4–0.8) and strong (> 0.8–1). The SALT score can be considered an objective marker of the severity of AA. It was assumed that people with more extensive hair loss would have worse HRQL. Therefore, we hypothesised that the AAPPO would show a moderate to strong correlation with the extent of hair loss (SALT), but the EQ-5D-5L would only show a weak to moderate correlation with SALT score [21, 24, 37]. This was assessed using data from pooled timepoints to increase the number of observations in the analysis; correlations at baseline, week 24 and week 48 were also assessed to explore any differences.

2.3.3 Known Groups Validity and Responsiveness

Two different variables (PGI-C and SALT) were used to define known groups and assess differences in HRQL outcomes. PGI-C responses of greatly improved (1) and moderately improved (2) were (a priori) categorised as responders. Participants were categorised according to SALT scores at follow up: those who remained at SALT 50+ were compared with those who had experienced substantial regrowth (SALT score of 0–10). Within known group differences were assessed using two-way paired t-tests. The degree of difference in the outcome variables was estimated as an effect size using Cohen’s d statistic. The effect sizes were interpreted as small (d = 0.20), medium (d = 0.50) and large effects (d = 0.80) [42].

To assess responsiveness, changes in participant scores for the EQ-5D-5L, SF-36v2, AAPPO subscales from baseline to week 24 and week 48 were compared across known groups. Correlations between sample mean changes from baseline to week 24 and 48 for each measure were then assessed using Pearson correlation coefficients. The strength of correlation coefficients was interpreted as: weak (r = 0–0.4) moderate (r > 0.4–0.8) and strong correlation (≥ 0.8–1). It was hypothesised that the EQ-5D-5L change scores would have a weak-to-moderate correlation with the SF-36v2 and AAPPO change scores and a weak correlation with SALT change scores.

2.3.4 Exploratory Factor Analysis

An exploratory factor analysis (EFA) using individual items from the AAPPO, EQ-5D-5L and SF-36v2 baseline data was conducted. Items intending to measure the same HRQL concepts across instruments were anticipated to load onto the same factors. If EQ-5D dimensions do not load on factors that emerge from the EFA (but AAPPO or SF-36v2 do), it may suggest that the EQ-5D-5L is unable to measure these factors. This could indicate a measurement weakness in EQ-5D-5L.

Best practice guidance was followed and recommended statistical tests to assess assumptions were performed (e.g. unidimensionality, multicollinearity and normality) before determining the appropriate procedures for analysis [43]. Missing data were initially removed from the dataset. A correlation matrix was used to identify and remove any individual items that were highly correlated. A Keiser–Meyer–Olkin (KMO) test was used to assess sampling adequacy, and Bartlett’s test for sphericity was used to assess if the correlation matrix differed significantly from an identity matrix. These data, scree plots, parallel analysis and Kaiser’s criterion (eigenvalues > 1) were then used to determine if Promax or varimax rotation was most appropriate for the factor analysis.

3 Results

Data were available from 612 participants, across the three timepoints. Demographic data for the sample at baseline are provided in Table 1. In total, 22% of the sample had AU (i.e. complete loss of scalp, face and body hair), and 88.1% had eyebrow or eyelash hair loss.

Table 1 ALLEGRO-2b/3 baseline (week 1) patient sample demographic characteristics

3.1 Ceiling Effects

Table 2 presents that just over half of participants at each timepoint reported themselves in EQ-5D state 11111 (baseline: 55.9%, week: 24 55.3% and week 48: 61.2%). The distribution of SF-36v2 MCS scores stratified by whether participants were in full health on the EQ-5D-5L (11111) or any other EQ-5D-5L state is shown in Fig. 1. This figure illustrates the degree of variability in scores on the SF-36v2 for patients defined as in full health on EQ-5D-5L. Ceiling effects were smaller for the AAPPO emotional symptom subscale but similar to the EQ-5D-5L for the AAPPO activity limitations subscale. No ceiling effects were observed for the SF-36v2 summary scores.

Table 2 Descriptive statistics of outcome measures collected during the ALLEGRO-2b/3 trial presented by baseline, mid-trial (week 24) and trial end (week 48) timepoints
Fig. 1
figure 1

Histograms of the distribution of 36-Item Short Form Health Survey version 2 (SF-36v2) Mental Component Summary scores by participants responding with full health scores (11111) against any other scores on the EQ-5D-5L at week 24 and week 48

3.2 Convergent Validity

Convergent validity analyses revealed the EQ-5D-5L correlated with both component summary scores of the SF-36v2 and with the AAPPO subscale scores (Table 3). As hypothesised, the strength of these correlations was moderate. There was a very weak correlation with the SALT score. The patterns of strength of correlations between the generic measures (EQ-5D-5L, SF-36v2) and the AA specific outcomes (AAPPO, SALT score) were similar for both generic measures (although the SF-36v2 showed higher correlations with the AAPPO scores). The AAPPO subscales more strongly correlated with SALT scores compared with both generic measures, although correlations remained weak.

Table 3 Convergent validity: correlations between outcome measure scores across all timepointsa

3.3 Known Groups Validity and Responsiveness

The EQ-5D-5L scores had very small differences between known groups defined by change in SALT score or PGI-C response.Table 4 presents very small effect sizes. Similar findings were observed for the SF-36v2 with small differences observed between SALT and PGI-C subgroups (Appendix A). By contrast, the AAPPO subscale scores generally detected larger differences by SALT and PGI-C known groups, with some large effect sizes observed (Table 4).

Table 4 Known-groups validity of the EQ-5D-5L and alopecia areata patient priority outcomes tool (AAPPO) subscales

Further responsiveness assessments (Table 5) revealed statistically significant but weak correlations between changes in the EQ-5D-5L and changes in SALT (W24: r = − 0.13, p < 0.05; W48: r = − 0.19, p < 0.05). The relationship between the EQ-5D-5L and SALT change scores are also visualised in a scatter plot (Appendix B), which also highlights the limited change in EQ-5D-5L scores for many participants despite significant hair regrowth. A similar pattern of correlations of a similar magnitude were observed for the SF-36v2 MCS but not the SF-36v2 PCS (Appendix C). Changes in the AAPPO subscales demonstrated a stronger correlation with changes in SALT, although they remained weak (Appendix C).

Table 5 Responsiveness: correlations between changes in EQ-5D with changes in alopecia areata patient priority outcome tool (AAPPO), severity of alopecia areata tool (SALT) and SF-36v2 scores from baseline to week 24 and 48a

3.4 Exploratory Factor Analyses

The main assumptions were met to conduct the EFA. Input variables were continuous with no outliers present in the dataset; n = 11 instances of missing data were removed leading to a total sample size of N = 601. The Kaiser–Meyer–Olkin (KMO) test revealed an overall measure of sampling adequacy (MSA) of 0.92, and no individual items had an MSA < 0.50, indicating adequate sampling. The correlation matrix and Bartlett’s Test of Sphericity (p < 0.001) confirmed reasonably strong linear correlations between the variables and no multicollinearity was observed.

Oblique rotation was initially deployed assuming factors would be correlated. Based on low correlations (≤ 0.30) in the factor correlation matrix the factor analysis was re-run using orthogonal rotation. The final factor structure was then assessed using a scree plot, parallel analysis and Kaiser’s criterion. Factor loadings of < 0.40 and factors with fewer than three items were considered eligible for exclusion leading to a nine-factor solution capturing 56% of the variance observed between all 51 variables wherein all factors demonstrated acceptable to good reliability (α = 0.72–0.93).

SF-36v2 and EQ-5D-5L items grouped into several conceptually related factors relating to themes of generic HRQL as anticipated. AAPPO items loaded onto factors based on their defined subscales (factor 4, emotional symptoms; factor 7, hair loss; factor 8, activity limitations) and did not include any items from the EQ-5D-5L or SF-36v2. The final factor structure is presented in Appendix D.

4 Discussion

Evidence exists describing the burden of AA on HRQL [20, 21, 24, 44] including two large systematic reviews [16, 17]. The HRQL impact includes not only the psychosocial burden but also a broader impact in terms of patients’ willingness to undertake daily activities. This evidence clearly suggests that the burden and impact of AA extends beyond hair loss, leading to a much wider impact on HRQL. The AAPPO data from the ALLEGRO trial support this conclusion.

The evidence from this analysis of outcomes data suggests that the EQ-5D-5L does not accurately measure the impact of AA. The EQ-5D-5L moderately correlated with all PRO measures assessed indicating conceptual overlap. However, EQ-5D-5L scores very weakly correlated with the extent of hair loss (SALT scores). Therefore, the EQ-5D-5L may not be sensitive to potential HRQL impacts associated with more severe hair loss. There was a high proportion of ceiling effects observed in the EQ-5D-5L data. The data in Fig. 1 show the degree of variability in SF-36v2 MCS scores for people who all reported themselves in full health (state 11111). This variability was substantial and suggests that there are important differences between patients which are not reflected in EQ-5D-5L scores. The EQ-5D-5L was unable to clearly differentiate subgroups of patients defined by SALT score or PGI-C groups, so called known-groups validity. The factor analysis indicated that there were areas of variance in the outcomes data that were assessed by the AAPPO measure but not by the EQ-5D-5L – suggesting a possible limitation in content validity. AAPPO subscales generally outperformed the EQ-5D-5L in many regards including responsiveness, convergent validity and known-groups validity, suggesting that the EQ-5D may fail to detect the full HRQL impact associated with changes to levels of hair loss.

For many decision makers (especially in HTA), the benefit of hair regrowth in people with AA is actually assessed in terms of the improvement in HRQL. The HRQL gain helps determine the value of the treatment. As noted above, the standard method to measure this treatment value is through the administration of generic measures of HRQL, such as the EQ-5D-5L, which can be used to estimate QALYs. The analyses reported here confirm the views of the BAD and NICE, that the EQ-5D-5L may have important limitations as a measure of HRQL in AA [31, 32]. If the EQ-5D cannot measure the full impact of AA, then HTA bodies may underestimate the value of a treatment. There is a risk that patients will not be afforded access to effective treatments because the EQ-5D-5L is unable to measure treatment benefits. Decision makers, such as NICE, may consider other approaches to measure HRQL in such instances when the EQ-5D-5L has been shown to be inappropriate [45], including the use of other generic preference-based measures (PBMs), condition-specific PBMs of HRQL, where available, or the use of vignettes. EQ-5D-5L ‘bolt-on’ modules could also be used to improve the sensitivity of the measure by including relevant aspects of HRQL that are missing. No EQ-5D-5L bolt-on modules have been developed or validated in patients with AA; however, the psoriasis-specific version of the EQ-5D-5L (EQ-PSO) [10], which includes ‘skin irritation’ and ‘self-confidence’ bolt-ons, has been shown to improve the content validity of the EQ-5D-5L in patients with psoriasis [11]. Other PBMs under development, such as the EQ Health and Wellbeing instrument (EQ-HWB) and the Recovering Quality of Life (ReQoL), may be more appropriate for capturing the wider impacts of conditions such as AA [46, 47]

Although the psychometric properties of the EQ-5D-5L in the context of AA has not been fully explored, two previous studies in literature which report EQ-5D-5L scores in AA are worth considering because they report real-world data. Vañó-Galván et al. (2023) report EQ-5D-5L scores ranging from mild AA (0.89) to severe AA (0.77) among European patients [48]. A similar study in the USA reported EQ-5D-5L scores for mild AA (0.95) and severe AA (0.87) [44]. In both studies, the results are based on clinicians’ subjective assessment of AA severity, rather than the use of a standardised assessment of AA hair loss. Hence, the low EQ-5D score for severe AA may have arisen based on not only hair loss but also the perceived psychosocial impact. The scores from these studies are similar to the data from the ALLEGRO trial but are perhaps a little worse for the most severe patients.

The focus of this paper was on the EQ-5D-5L because of its importance in HTA, but the analyses also showed that the SF-36v2 showed similar limited sensitivity. Generic measures (EQ-5D-5L and SF-36v2) may fail to capture the psychosocial impacts of AA. Future research could usefully explore the relevance of items on both measures for people with AA and why the burden of the condition is not fully captured. The condition specific measure, the AAPPO, was much more sensitive to differences in patients by changes in SALT score and PGI-C response.

A limitation of this analysis was that the patients in the ALLEGRO-2b/3 trial were all SALT > 50 at baseline, meaning that comparisons could not be made with those who have lower SALT scores at baseline. Secondly, this dataset excludes participants with major psychiatric conditions (including suicidal ideation and moderately severe depression) from the clinical trial which may have led to an underestimation of the severity of mental health impacts in this sample and introduced some measurement problems (e.g. limited ability to detect improvement and compare scores between known groups). Another limitation is that this study was unable to explore the reliability of the EQ-5D-5L as the measure was not administered repeatedly before treatment was received (e.g. at screening and then baseline). However, the trial data were useful for assessing patterns of response data, especially ceiling effects which can indicate measurement limitations.

To conclude, the present data suggest that the EQ-5D-5L may not be adequate to measure the burden of AA and treatment-related benefit with hair regrowth. Other approaches for measuring HRQL, as recommended by decision makers such as NICE, should be considered to capture the full HRQL burden of AA.