Introduction

A competing risk presents when an individual is at risk of more than one mutually exclusive event and its occurrence precludes the occurrence of the primary event of interest [1, 2]. Death is a competing event in fracture risk assessment because individuals who have died would biologically have no chance of sustaining a fracture. Assessing the absolute risk of fracture is crucial in the diagnosis and management of osteoporosis as fragility fractures pose a significant public health problem worldwide [3, 4]. This assessment enables the identification of suitable preventive and therapeutic interventions that can be implemented to mitigate the risk of fracture and its associated consequences. Nevertheless, the estimation of the predicted risk of fracture becomes inaccurate in the presence of a competing risk of death as elderly individuals might die due to other reasons before experiencing a fracture [1, 2]. The adjustment for competing risk of death is therefore highly relevant for fracture assessment since fractures increase exponentially with age and the global population is aging.

At present, non-osteoporotic individuals whose 10-year risk of fracture of at least 20% are recommended for anti-fracture treatment [5, 6]. However, all existing fracture risk assessment tools have inadequate predictive capability [7]. Among the tools with the best discrimination ability, the Garvan [8] and QFracture Calculators [9] overestimate the risk of fracture in the high-risk groups, whereas the FRAX® [10] underestimates fracture risk [7]. In addition to differences in predictors, these models manage the competing death differently. While the Garvan [8] and QFracture Calculators [9] treat death without a fracture as a right-censored event, the FRAX® [10] is stated to “account for competing death” though it remains unclear how the FRAX® accounted for competing risk of death.

The most common methods for competing risk adjustment are the cause-specific hazard regression and the Fine-Gray sub-distribution hazard regression [11,12,13]. The cause-specific hazard regression combines both a model for fracture and that for mortality to estimate the risk of fracture [12, 13]. The Fine-Gray method assumes that individuals who have encountered a competing event are still susceptible to the primary event. This assumption, though unnatural in the context of competing death, is technically necessary to establish the one-to-one relationship with the cumulative incidence function, thus being capable of developing a model that correctly predicts the absolute risk of the primary event [14]. The multistate regression method models the progression from one state to another (e.g., from no fracture to mortality, from no fracture to fracture, or from fracture to mortality). This approach enables the competing risk of death to be naturally taken into account [15], resulting in unbiased estimates for each related outcome individually [16]. In this study, we sought to test the hypothesis that the multistate model predicts fracture risk as accurately as the most common competing risk adjustment methods. The results of this study will offer valuable insights into individualized risk assessment, thereby aiding in the identification of individuals who are at an increased risk of fracture. Using fracture risk assessment as an example, the study can inform the more general problem of predictive research in the presence of competing risks.

Methods

Study design and participants

We used data from the Dubbo Osteoporosis Epidemiology Study for which the study design and protocols have been described in detail elsewhere [17]. Briefly, through the electoral roll and via media campaign, all community-dwelling women and men aged 60 years or older as of 30 June 1989, living in Dubbo City, New South Wales, Australia, were invited to participate in the study. There is only one hospital and three radiology services for the entire Dubbo region. This centralized healthcare system, in addition to a geographically isolated research community, allows a complete ascertainment of all fractures and mortality among elderly people aged 60 years or older in the whole Dubbo region, making censoring minimal [18]. The study was approved by St. Vincent’s Hospital Human Research Ethics Committee, New South Wales, Australia (HREC reference number: 13/254) and carried out according to the Australian National Health and Medical Research Council Guidelines, consistent with the Declaration of Helsinki. All participants provided written informed consent.

Regular visits were conducted biennially for a detailed and ongoing assessment of bone health. At recruitment and each visit, a nurse coordinator interviewed participants by administering a structured questionnaire to obtain anthropometric data, lifestyle factors, number of falls during the previous 12 months, prior fracture after the age of 50 years, chronic health disorders and medications prescribed. Bone mineral density (BMD) was measured at the lumbar spine and femoral neck by dual-energy x-ray absorptiometry (Lunar DPX-L; GE-Lunar).

Outcome assessments

The X-ray reports from all three radiology services for the entire Dubbo area were reviewed regularly to identify incident fractures occurring between recruitment until recently. The circumstances surrounding each fracture were determined by phone call after each fracture. The analysis included only fractures involving minimal trauma less than or equivalent to fall from standing height. High-trauma fractures, those due to underlying diseases, e.g., cancer or Paget disease, or those of digits, skull, or cervical spine were excluded. All deaths in the region were obtained from funeral lists and obituary review with verification from the State Registry of Births, Deaths, and Marriages.

Statistical analysis

As all models included five predefined predictors [8] and the study aimed to quantify the predictive performance in the validation cohort, the study population was randomly split into the development cohort (60%) and the validation cohort (40%) [19].

First, we fitted four regression models that apply different statistical methods to account for the competing death in the development cohort. They included (i) the conventional Cox’s proportional hazard model, (ii) the cause-specific hazard model, (iii) the Fine-Gray sub-distribution hazard model, and (iv) the multistate model. The conventional model estimates the risk of fracture, right censoring the competing death (i.e., the death without a fracture); whereas the cause-specific hazard, Fine-Gray and multistate models apply different methods to account for the competing death (Supplemental Methods).

Briefly, the conventional approach models fracture risk under the assumption that individuals who remain under follow-up have the same fracture risk as those who die without a fracture as if the occurrence of fracture is independent of the occurrence of death without a fracture (Figure S1A). By contrast, the cause-specific hazard approach, as the name implies, models the cause-specific hazards for fracture and those for death without a fracture separately and then combines these two models’ coefficients to obtain a valid estimation of the cumulative hazard for fracture (Figure S1B) [11, 12]. The Fine-Gray method treats individuals who have died without a fracture as if they are still at risk of fracture, representing “immortal” time, but assigns a gradual reduction of weights for those with the competing death in modeling fracture risk (Figure S1C) [14]. Finally, the multistate model treats fracture and death without a fracture as two separate “states” but takes their complex inter-correlation into account (Figure S1D) [15]. Whereas the other approaches compute the cumulative incidence of fracture, the multistate model estimates the transition risk from the “event-free” state to the “fracture” state at a particular time t which is technically the fraction of individuals with a fracture at time t.

Follow-up time to fracture was calculated from the recruitment date to the date of fracture, while the follow-up time was calculated until the date of death for individuals who died without a fracture, the date of last visit or 30 June 2018, whichever came first for those who remained fracture-free. All four models used the same fracture predictors, including sex, age, femoral neck BMD, the presence of falls during the last 12 months and the presence of prior fracture after the age of 50 years prior to the study entry [8] to allow cross-comparison of their predictive performance. These predictor variables had no missing data. A proportional hazard assumption was graphically checked using the Schoenfeld residuals [20].

Secondly, we quantified the predictive accuracy of the four regression models in the validation cohort using both discrimination and calibration analyses that have been widely employed to validate the predictive accuracy of the existing fracture risk assessment tools for predicting the occurrence of fracture at clinically relevant time points [7]. Specifically, we examined the predicted absolute risks of fracture at 5 and 10 years of follow-up with the primary focus on the 10-year risk that is widely used in reality to identify high-risk individuals [5, 6, 21].

The discrimination performance was primarily quantified using Harrell’s concordance C index [22] with a value closer to 1 indicating better discrimination. Harrell’s C index was calculated specifically for each of the four models of interest. We used a flexible calibration curve with the addition of confidence limits for predicted group categorization [23] as the primary calibration measure for the moderate model calibration which has been shown to be realistic in epidemiologic research and considered a pragmatic guarantee that decision-making based on the model is not clinically harmful [24]. The calibration curve is constructed for centiles of predicted fracture risk with the closer concordance between the predicted fracture risks and observed fracture rates to the line of perfect prediction indicating better calibration. The predicted fracture risks were estimated from the prediction models for each participant in the validation cohort at single time point, whereas the observed fracture rates were computed as the number of participants who sustained a fracture up to the specific time point over a total of participants in each centile of predicted risk [24]. The calibration curve with its corresponding 95% confidence interval (CI) is then drawn as the average predicted fracture risk in each centile as the x-axis against the observed fracture rate in the same centile as the corresponding y-axis [24]. For quantitative comparison of the calibration performance across the models, we reported the “calibration-in-the-large” index that quantifies the overall difference between the average observed event rate and the average predictive risk [25] and the estimated calibration index that summarizes a flexible calibration curve into a single value [26]. Ideally, the calibration-in-the-large index is zero. The prediction of fracture risk was considered accurate if the average predictive values were not significantly different from the average observed fracture events (i.e., the 95% CI of the calibration-in-the-large index includes a reference unity of zero) [25]. Similarly, the estimated calibration index, calculated as the average squared difference between predicted risk and the observed event rate is zero if the flexible calibration curve is perfect. The estimated calibration index has been thus recommended as a valid measure for easily comparing calibration performance across different prediction models [26]. Other secondary measures of model’s discrimination and calibration performance were also reported (Table S1).

We conducted a sensitivity analysis that mimicked the “standard” data collection in a conventional longitudinal study. The sensitivity analysis, therefore, censored the follow-up time at the last visit date and included only outcome events (i.e., fracture or death) occurring at or prior to the “hypothetically” last visit that a participant should have shown up if he or she had neither died nor been lost to follow-up (Figure S2).

The analyses were performed using the R statistical environment on a Windows platform (R-4.0.2) [27]. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.

Results

Baseline characteristics

The study involved 3035 participants aged 60 years or older at recruitment who were randomly split into the development and validation cohorts (Fig. 1). During a median of 11.3 years of follow-up (IQR: 7.2, 16.2), 628 (34.5%) and 630 (34.6%) participants in the development cohort fractured or died without a fracture, yielding the incidence rate of 3.35 fractures/100 person-years (95% CI: 3.10, 3.63) and 2.87 deaths/100 person-years (2.65, 3.10), respectively. Within the first 5 years of follow-up, 223 participants (12.3%) fractured and 146 (12.2%) died without a fracture.

Fig. 1
figure 1

Flowchart of recruitment and follow-up

As expected, the participants who had sustained a fracture or died without a fracture were older and had more chronic health disorders than those who remained event-free until the end of the study (Table 1). Fracture patients also had poorer bone health, more falls during the last 12 months and a greater number of fractures prior to the study entry. We found the incidence of fractures and mortality, estimated by the Kaplan–Meier method that ignores a competing risk of death was higher than those estimated by a cumulative incidence function accounting for a competing death (Figure S3).

Table 1 Baseline characteristics of the participants stratified by fracture and mortality status

In the development cohort, the coefficients associated with predictors of fractures were identical between the conventional Cox’s model, cause-specific hazard, and multistate models (Table 2). The proportional hazard assumption was met for all four fracture risk models.

Table 2 Association between baseline risk factors and fracture: compare the regression coefficients between the conventional Cox’s proportional hazard model with no competing risk adjustment, and the competing risk models, including the cause-specific model, the Fine-Gray model, and the multistate model

Comparison of the predictive performance between models

In the validation cohort, 418 participants (34.4%) had died and 418 participants (34.4%) had fractured within a median follow-up time of 11.1 years (IQR: 6.9, 16.6). These figures were highly comparable with the development group (Table 1).

While all prediction models had similar discrimination performance, their calibration performance for fracture prediction was affected by the prediction time (Figs. 2, 3; Table 3). Specifically, we found similar measures of discrimination performance across all prediction models, e.g., the C-statistic ~ 0.73 and 0.69, the Somers’ Dxy ~ 0.47 and 0.38, and the discrimination index ~ 0.07 and 0.07 at 5- and 10-year risk assessments, respectively. By contrast, all competing risk models predicted fracture risk accurately up to 5 years, whereas the conventional Cox’s model tended to overestimate the 5-year risk of fracture with the calibration-in-the-large index being − 0.24 (95% CI: − 0.43, − 0.06) (Fig. 2; Table 3). The differences in the predicted probability of fracture between the models became pronounced when death without a fracture became prevalent (Fig. 3; Table 3). Specifically, at 10-year risk assessment, the conventional Cox’s model overestimated fracture risk significantly (calibration-in-the-large index: − 0.54; 95% CI: − 0.70, − 0.39). By contrast, the multistate model (calibration-in-the-large index: − 0.05; 95% CI: − 0.20, 0.10) demonstrated more accurate prediction for fracture risk than the cause-specific hazard (− 0.23; − 0.30, − 0.08) and Fine-Gray model (− 0.31; − 0.46, − 0.16). Similar patterns were observed from other measures of the predictive performance, such as the estimated calibration index (0.236 in the multistate model versus 1.151, 0.241, and 0.296 in the conventional, cause-specific, and Fine-Gray models, respectively) (Table 3).

Fig. 2
figure 2

Prediction accuracy of the conventional Cox’s proportional hazard model with no competing risk adjustment, and the competing risk models, including the cause-specific model, the Fine-Gray model, and the multistate model: 5-year fracture risk

Fig. 3
figure 3

Prediction accuracy of the conventional Cox’s proportional hazard model with no competing risk adjustment, and the competing risk models, including the cause-specific model, the Fine-Gray model, and the multistate model: 10-year fracture risk

Table 3 Performance measures of the conventional Cox’s proportional hazard model with no competing risk adjustment, and the competing risk models, including the cause-specific model, the Fine-Gray model, and the multistate model

The sensitivity analysis was conducted to mimic the “standard” data collection in which the outcome data of the lost-to-follow-up participants are not obtainable. As expected, the sensitivity analysis had a shorter follow-up time and fewer outcome events than the primary one. During a median follow-up of 9.7 years (IQR: 5.9, 14.5), 505 (27.7%) and 271 (14.9%) participants in the development cohort fractured or died without a fracture, respectively. Nonetheless, the outcomes of the sensitivity analysis corroborated those of the main analysis, indicating that the multistate model was capable of predicting fracture risk as accurately as the cause-specific hazard model, and more accurately than both the conventional Cox’s and Fine-Gray models (Figure S4).

Discussion

The competing risk of death is an important statistical consideration in prediction research that involves elderly people who are also at high risk of death because failure to account for the competing death could result in an inaccurate estimation of the primary event risk, probably leading to treating the wrong patients. In this study, by using real-world data with a full assessment of fracture and mortality, we demonstrated that without accounting for the competing risk of death, there is a risk of overestimating the long-term probability of fracture. More importantly, we found that the multistate regression outperformed all other methods in adjusting for competing risks.

Our results confirmed a previous study which observed that an adjustment for competing risk of death, regardless of the statistical methodology, has minimum impact on the short-term (e.g., 2 years) prediction accuracy [28]. However, for long-term prediction (e.g., 10 years), failure to adjust for competing death resulted in an over-estimation of the predicted risk of the event of interest [1, 11,12,13, 29]. We also found the multistate model yielded the best calibration compared with the cause-specific hazard and Fine-Gray competing risk models in the presence of a competing event of death.

If the basic observation is considered multiple small segments of follow-up for each individual, rather than a single time-to-event or censoring point, there are alternatives to Cox’s proportional hazard model. These alternatives include, among others, the Poisson regression model and the accelerated failure time model, which focus on the rate rather than the time to response. In fact, the development of FRAX was based on the Poisson regression model [10]. However, a notable advantage of the Cox model is its ability to easily produce estimates of survival probabilities in clinical studies with a well-defined common entry time for all individuals, thus utilizing a single timescale. The adjustment for competing risk of death in the Cox model is methodologically challenging. For the cause-specific hazard model, it is required that the model be fitted to not only the primary event but also to all competing risk events to obtain a valid prediction [12, 13]. Although the cause-specific coefficients are derived from the separate cause-specific models, the mathematical formula used to combine all different cause-specific hazard models for predicting the primary event is often seen as a “black box,” making it difficult to communicate prediction rules [13]. Furthermore, the assumption of proportionality is not often met for all cause-specific models, and therefore, it may be necessary to recognize non-proportionality or incorporate a time-interaction term to account for the estimated effect size varying over time [12]. By contrast, the Fine-Gray model, though recommended for predictive research in the presence of a competing death, relies on a counterintuitive assumption that individuals who have died are still at risk of the primary event [11, 12, 14]. Furthermore, it is not possible to directly link the effect of the risk factors for fracture on the underlying fracture risk in real-world scenarios [30]. As argued by Fine and Gray [14], the individuals who have experienced the competing risk event might be viewed as a “placeholder” for a fraction of the population who cannot experience the primary event and, as such, can constrain the definition of the sub-distribution hazard function. To account for the competing risk, the sub-distribution hazard for the primary event (i.e., fracture) among individuals who have experienced the competing event is calculated with a gradual reduction in weight over time [14].

In view of the above conceptual and methodological challenges, the multistate model appears to be superior to the cause-specific hazard and Fine-Gray models in fracture risk assessment for the elderly who are also at high risk of death. It is widely recognized that patients with a history of fractures are at greater risk of mortality compared to those without a fracture. Furthermore, those with a fracture have a higher likelihood of sustaining a subsequent fracture, which significantly increases their risk of mortality [3]. The multistate model enables the estimation of the probability of transitioning from one state to another, which is highly relevant in osteoporosis research [31, 32]. It is also easy to formulate a multistate model to estimate predicted risk based on a specific risk profile, making risk prediction rules more easily communicated than with the cause-specific hazard model. The assumption of the multistate model is more intuitive than the Fine-Gray model as the population at risk in the multistate model does not include individuals who have died. Most importantly, the multistate model is capable of estimating the risk of other correlated events within a single framework, providing adequate consideration for the complex intercorrelations among the events of interest, including competing risks [16]. For instance, the multistate model can predict not only the risk of fracture, or death without a fracture but also the consequences of fracture such as a subsequent fracture and post-fracture mortality [31, 32]. Therefore, we consider that the multistate model is a method of choice for the assessment of fracture risk in the elderly who are also at high risk of mortality.

Using fracture as an example, our study shed light on the more general issue of prediction of an adverse health event in the presence of competing risk of death in predictive research on aging. The multistate model should be considered one of preferred statistical approaches and all competing risk models examined in this study should be considered and compared for predictive research in the presence of a competing risk. Specifically, the findings can be used to extend the current recommendation [33] that the most rigorous scientific approach to analyzing competing risk data should require not only the cause-specific hazards from the cause-specific model, the cumulative incidence function from the Fine-Gray model but also the transition risks from the multistate model be analyzed and reported side by side. The results from different competing risk models are expected to complement each other and provide more thorough understanding of the event dynamics.

Our findings should be considered within the context of their strengths and limitations. This is the first methodology study that compared the predictive performance of the multistate model against the current competing risk models. The analysis was conducted using real-world data of more than 3000 elderly people whose health status has been monitored for a median of 11 years, providing ample time for the majority of participants to experience either the primary event of fracture or the competing event of death. Another strength of this study is that it is conducted in a geographically isolated setting with a centralized healthcare system, enabling a comprehensive assessment of outcome events and capturing data for participants who did not present for subsequent visits and making censoring minimal. Additionally, the sensitivity analysis mimicking conventional cohort studies yielded consistent findings with the primary analysis, further validating the robustness of our results.

A potential limitation of the study is its limited generalizability since the study population was primarily Caucasian and from an industrialized country with a low mortality rate. However, as the predictive performance was made between different models in the same study context, i.e., the same predictors and the same study population, it is unlikely that the overall trends observed in our findings would differ significantly in other settings with different ethnicities and/or higher mortality rates.

In conclusion, our findings suggest that for long-term assessment, the adjustment for competing risk of death is necessary to produce an unbiased and accurate predicted fracture risk, and the multistate model should be considered one of the preferred statistical methods for adjusting competing risk in predictive research in the elderly.