Background

The Lee Fatigue Scale (LFS) [1] is frequently used to measure fatigue severity in a variety of patient populations. It has mostly been used to measure fatigue severity in patients with cancer [2,3,4], human immunodeficiency virus (HIV) [5], osteoarthritis [6, 7], stroke [7], obstructive sleep apnea [8], patients undergoing dialysis [9], and patients treated in intensive care units [10] as well as in pregnant women [11] and parents of preterm infants [12]. Several studies have shown that the LFS is sensitive in measuring diurnal patterns of fatigue, distinguishing between morning and evening levels of fatigue [3, 5, 13, 14].

When the psychometric properties of the original 13-item version of fatigue items on the LFS were evaluated with a Rasch analysis approach that compared fatigue scores obtained in the morning and evening [15], nine of the items showed satisfactory acceptable goodness-of-fit for the morning and 10 items for the evening measures. These items also demonstrated an acceptable level of uni-dimensionality and were able to differentiate the patients into four levels of fatigue severity.

Because of the nature of fatigue, it is important to reduce participant burden when conducting research on fatigue. Reducing the burden on participants is also relevant in the context of public health surveys, where participants are asked to respond to a broad range of questionnaires. A previous study found that most of the frequently used instruments to assess fatigue consisted of 10–20 items [16], which indicates that the development and validation of short fatigue instruments may be particularly useful. In a study that aimed to develop a short version of the LFS [17], a sub-set of five of the original 13 items was found to be sufficient to measure fatigue severity and satisfy criteria for internal scale validity, uni-dimensionality and separation of patients’ fatigue into three distinct levels, which is sufficient for many clinical and research purposes. A short version in Norwegian of the LFS has been used in clinical studies of Norwegian patients, but the psychometric properties has not been tested in the Norwegian general population. This is need in order to know if normative data from the general population can be used as reference values in relation to LSF scores in clinical studies. Furthermore, the LFS, in contrast to other fatigue scales, has been used for two daily measurements over several days to describe morning and evening fatigue [3, 5, 18]. Since the patients have to fill in the LFS so frequently, it is of particular interest to have a valid and reliable short version of the instrument.

Although fatigue is often studied in chronic illness populations, several studies show that fatigue is also experienced in the general population, including people without current diseases. Depending on the definition and cut-off value for fatigue cases, the prevalence of current fatigue in the general population has been reported to vary between 5 and 30% [19,20,21], and the prevalence of chronic fatigue (lasting more than six months) has ranged from 6 to 30% [19, 20, 22]. Given the large variation in prevalence between these studies the validity of cut-off scores designating a positive case of fatigue is of concern. Due to the nature of fatigue, a short fatigue instrument that requires little energy for respondents to complete and with satisfactory psychometric properties is warranted to survey fatigue in the general population. It is important that fatigue measures be validated for use not only in various patient populations, but in the general population as well. No previous validation studies of the LFS in the general population exist. Thus, the aim of this study was to examine the psychometric properties focusing on internal structure and precision [23] of a short version of the LFS in the general population in Norway.

Methods

A representative sample of the Norwegian population was surveyed in a cross-sectional study in order to establish normative data for a number of different instruments measuring different symptoms, health behaviors and attitudes. The National Population Register in Norway drew a representative sample of the Norwegian population. All registered citizens in Norway between 18 and 94 years of age were eligible to participate, and the sample was stratified according to age, sex, and geographic region. A total of 5500 citizens were invited to participate, including citizens from all of the country’s 19 counties. A more detailed description of the recruitment process has been published elsewhere [24, 25].

Procedures

The sample was mailed the questionnaires in 2015 with information about the study and a pre-paid return envelope. Each individual was mailed two reminders. Of the 4,971 survey recipients, 1,792 (36%) returned their survey. Between responders and non-responders, there were no significant differences in mean age, gender proportions or proportions living in rural versus urban areas [25]. The proportion of the sample in active work was 66%, compared to 67% in the general population [26]. Among responders and non-responders alike, 17% lived alone. Among responders, 1.3% were without work and 53% had higher education, compared to 4.4% and 41.0% in the general population, respectively [24]. In view of these comparisons, the sample was deemed fairly representative of the general Norwegian population.

Measurements

Fatigue severity was measured with a 5-item version of the LFS [1]. Each of the items has two anchor statements, and participants responded on an 11-point numeric rating scale, with responses ranging from 0 to 10. The five items anchor statements were: not at all tired – extremely tired (item #1), not at all fatigued – extremely fatigued (item #4), not at all worn out – extremely worn out (item #5), not at all bushed – extremely bushed (item #11), and not at all exhausted – extremely exhausted (item #12). The item numbers refer to the original LFS version [1]. The mean score of the 5 item scores constitutes each participant’s fatigue severity score. Demographic information on age (in years) and sex (male, female) was also collected.

Ethics

The Regional Committee for Medical and Health Research Ethics South East was consulted prior to distributing the survey. Because this was an anonymous survey conducted by mail, the committee did not require formal review of the study. Returned surveys signified consent to participate.

Statistical analysis

Descriptive statistics were used to summarize sample demographics and fatigue scores.

The psychometric analysis of the LFS was guided by a Rasch rating scale model [27]. The transformed 11-category raw scores (0 to 10) from the five LFS items were analyzed using the WINSTEPS Rasch computer software program, version 3.91.0.0 [28]. The analyses were performed using a systematic stepwise approach similar to that used in previous studies [5].

In Step 1, an evaluation of the psychometric properties of the fatigue rating scale was conducted to determine whether the average measures for each category on each item advanced monotonically, i.e. whether the Outfit Mean Square (MnSq) values were < 2.0 for each of the step calibrations [29].

Step 2 aimed to evaluate the fit of the item responses [27]. Any item that did not show acceptable goodness-of-fit to the model was removed, and the psychometric properties of the remaining items were re-analyzed until all remaining items demonstrated acceptable goodness-of-fit. For this study, acceptable goodness-of-fit was defined as Infit MnSq values between 0.7 and 1.3, which is stricter than the suggested guidelines for surveys using rating scales [30]. We choose to focus on infit statistics in this study as they are considered the most informative measure of goodness-of-fit given that they focus on the degree of fit in the most typical observations in the data [31]. We also focus on MnSq values rather than standardized z-values, as z-values are highly influenced by sample size [32]. In Step 2, we also evaluated the level of uni-dimensionality in the generated LFS measure by a principal component analysis (PCA) of the residuals, with the criterion that the first latent dimension should explain at least 50% of total variance [33]. We also monitored the standardized item residual correlations between items using the Winsteps output and considered a correlation coefficient between items of 0.5 or higher (equal to a shared variance of 25% or more) to be a threat to local independence.

Step 3 evaluated aspects of person response validity. Person goodness-of-fit was defined as Infit MnSq values less than 1.4 logits or associated with a z-value < 2, accepting that 5% of the sample may by chance fail to demonstrate acceptable goodness-of-fit without threatening evidence of person response validity [34,35,36].

In Step 4, differential item functioning (DIF) analyses were performed in order to evaluate the stability of the LFS response patterns in relation to age and sex using the Mantel–Haenszel statistics for polytomous scales using log-odds estimators [37, 38] as reported from the WINSTEPS program (p < 0.01 with Bonferroni correction). If DIF was detected for an item, a supplementary analysis of the impact of the item’s DIF was then calculated, based on a standardized z-comparison of the individual Rasch measures produced by the generic vs the sample-specific item hierarchies. Item DIF was considered to have minimal impact if 5% or less of the sample had a change of more than ± 1.96 z-value in their Rasch measures.

Step 5 assessed several aspects of the fatigue scale’s reliability. The unidimensional scale’s ability to separate participants into distinct groups was estimated using the person separation index. The Rasch-equivalent person reliability coefficient, as well as the Cronbach’s alpha reliability coefficient based on the LFS raw item scores, were also reported for the final unidimensional scale. Finally, a Pearson’s correlation coefficient was used to evaluate the relationship between the LFS mean raw score (calculated as the mean of the item raw scores) and the Rasch-generated measures.

Results

Fatigue in the general population

Of the 1792 survey respondents, 1767 had complete LFS scores for analysis. The mean age of respondents was 53.2 years (± 16.6 SD), with a range 18–94 years. Fewer men (46.9%) responded compared to women (53.1%). As shown in Table 1, women had a higher mean LFS score than men (p < 0.001), and adults < 55 years old had a higher mean LFS score than adults 55 years and older (p < 0.001). Descriptive statistics for each of the LFS items are shown in Table 2.

Table 1 Lee Fatigue Scale (LFS) mean scores for men and women (n = 1767)
Table 2 Descriptive statistics for five LFS items (n = 1767a)

Rasch analysis of LFS psychometric properties

Table 3 summarizes the findings of the Rasch analysis. In Step 1, the rating scale of the LFS demonstrated acceptable outcomes in relation to the set criteria. When analyzing the infit mean square statistics for the five items in Step 2, two items (items #1 tired and #5 worn out) demonstrated unacceptable fit statistics (see Table 3). We excluded these items and re-ran the analysis with the remaining three items. In the second iteration, all three remaining items demonstrated acceptable goodness-of-fit to the model. The uni-dimensionality of the 3-item LFS scale was also acceptable (83.6%).

Table 3 Overview of the Statistical Approach, Criteria, and Results of the Rasch Analysis of the LFS short form used in the general population (n = 1767)

In Step 3, a proportion of the sample close to the set criteria demonstrated misfit to the Rasch model in the 3-item LFS scale (6.4%). In Step 4, none of the three items demonstrated DIF in relation to Gender or Age (using median splits), so no supplementary analysis of the impact of item DIF was performed.

In Step 5, the separation index of the LFS scale decreased (from 3.45 to 2.49) after deleting the two items demonstrating misfit in Step 2, but still exceeded our set criterion. There was a strong correlation between the LFS raw score and the Rasch measure, which also remained stable (from 0.96 to 0.97) after deleting the two misfitting items. Person reliability for the Rasch measure and Cronbach’s alpha for the raw score were also acceptable but decreased after deleting the two misfitting items. Given the proportions of minimum and maximum scores, there was a minimal ceiling effect, but some evidence of a floor effect, which worsened after deleting the two misfitting items (from 7.3% to 18.0% having minimum scores).

Discussion

In this population-based study, a 3-item version of the Lee Fatigue Scale demonstrated acceptable psychometric properties for assessing fatigue in the general population. These findings provide further evidence for the psychometric properties of the LFS and support prior studies where we have shown that short versions of the LFS have satisfactory internal scale validity, uni-dimensionality and sensitivity to separate individuals with fatigue into three different groups (low, medium and high levels of fatigue) [17, 39]. Thus, a shorter measure of fatigue severity may be as psychometrically sound as longer measures. This is particularly important for measures of fatigue where the level of fatigue may impact on the respondents’ ability and capacity to participate in fatigue studies. While previous studies have assessed the psychometric properties of the LFS-VAS with scores converted from the 100 mm VAS to numeric scores from 0–10, the present study showed that the average measures for each response category advanced monotonically. The floor effect of the LFS-3 item was relatively high and higher than in previous studies using other versions of the LFS. However, this can be explained by the fact that the current sample represents the general population where fatigue is expected to be considerably lower compared to samples of clinical patients with acute or chronic illness. Thus, the floor effect in this study may not be considered a psychometric weakness, as it is expected that not everyone in the population would experience even a low perceived level of fatigue. While a previous psychometric assessment of the LFS was performed in a sample of women, this current study included men and demonstrated that two of the items from LFS measure showed DIFs biased by sex, i.e. women endorsed “fatigue” more easily than men, while men endorsed “tired” more easily than women.

The core dimension of fatigue severity used in the 5-item English language LFS is reflected in the similar but unique wording of items (tired, fatigued, worn out, bushed and exhausted). The same core dimension may not be reflected when using Norwegian words with very similar meanings. Language may indeed impact the cultural validity of a short Norwegian version of LFS with the particular combination of items used in our study. Conceptual translations for cross-cultural comparisons are complex and would require additional studies to ensure that items or concepts in different languages address the same construct.

Slightly higher levels of fatigue than expected were found in persons demonstrating misfit to the Rasch model. More data may be needed here to assess whether any systematic pattern among the persons demonstrating misfit can be detected and allow for specific subgroup analysis. Earlier studies using the LFS have shown item calibration differences between diagnostic groups [7], which may be hidden in general adult population samples, such as the sample used in this study.

Some of the items had a relatively high proportion of minimum scores. However, since our study surveyed fatigue in the general population, we expected that a large proportion did not experience fatigue.

A significant challenge for building cumulative knowledge of fatigue in different groups of patients and populations is the absence of consensus among researchers about which fatigue instrument should be used. There are several promising initiatives to develop international recommendations for measuring fatigue and psychometrically sound instruments that are available in many languages, such as the 13-item PROMIS SF v1.0 Fatigue Scale [40]. While such measures may have some advantages, the psychometrically valid 3-item LFS may be more suitable when a shorter scale is needed.

Strengths

A relatively large sample drawn from the general population provides evidence of validity of a short three-item version of the LFS that is also sensitive enough to differentiate the sample into three distinct groups by level of fatigue.

Limitations

Although our findings in a Norwegian version of the LFS are similar to those in the English version, caution in interpreting the findings is warranted. Translating fatigue and similar symptoms from the original English version is difficult and can be problematic when attempting to generalize our findings to other non-English speaking cultures. Even American English could have subtle differences in meaning compared to British English wording for items in the LFS. The Norwegian LFS items evaluated in this study originate from a translation done for a prior study with a diagnosis-specific female sample [41]. In addition, the translation process was not performed according to the current standards described in the COSMIN framework [42]. Furthermore, Since the survey did not include the full original Lee Fatigue Scale we only had the opportunity to validate this short version based on data generated using the pre-selected items. Thus, the generalization and applicability of the LFS scale in general to a wider population can be questioned. With a limited number of fatigue items in this study, there is also a risk that some other LFS items, now not included, could have demonstrated acceptable goodness-of-fit to the Rasch model. This could also impact on evidence based on test content, potentially missing additional essential aspects of fatigue. Various validity studies of the LFS scale have also suggested different item combinations to measure the level of fatigue in a valid manner [17, 39]. This can be viewed as a limitation especially when comparing outcomes from different studies using different item combinations. A solution to this challenge could be to apply a Rasch analysis model to generate a stable item bank or item hierarchy across large samples with diversity in diagnoses and languages that demonstrate acceptable validity, evidence of stability and internal construct validity. Providing the generic weights of each item calibration can then be used to select and use subsets of items for specific studies and samples, and still generate comparable measures. Such systematic approach should also be better grounded in the COSMIN guidelines regarding the full validation process, including translation.

Conclusions

We surveyed Norwegian adults across the age spectrum to evaluate the psychometric properties of a Norwegian version of the LFS. Our results provide evidence for the validity of a short three-item version of the tool. Therefore, this version of the instrument can be readily applied to measure levels of fatigue in the general Norwegian population.