Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease

Grob, Alexandra; Rohr, Jonas; Stumpo, Vittorio; Vieli, Moira; Ciobanu-Caraus, Olga; Ricciardi, Luca; Maldaner, Nicolai; Raco, Antonino; Miscusi, Massimo; Perna, Andrea; Proietti, Luca; Lofrese, Giorgio; Dughiero, Michele; Cultrera, Francesco; D’Andrea, Marcello; An, Seong Bae; Ha, Yoon; Amelot, Aymeric; Bedia Cadelo, Jorge; Viñuela-Prieto, Jose M.; Gandía-González, Maria L.; Girod, Pierre-Pascal; Lener, Sara; Kögl, Nikolaus; Abramovic, Anto; Laux, Christoph J.; Farshad, Mazda; O’Riordan, Dave; Loibl, Markus; Galbusera, Fabio; Mannion, Anne F.; Scerrati, Alba; De Bonis, Pasquale; Molliqaj, Granit; Tessitore, Enrico; Schröder, Marc L.; Stienen, Martin N.; Regli, Luca; Serra, Carlo; Staartjes, Victor E.

doi:10.1007/s00586-024-08395-3

Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease

Original Article
Open access
Published: 11 July 2024

Volume 33, pages 3534–3544, (2024)
Cite this article

Download PDF

You have full access to this open access article

European Spine Journal Aims and scope Submit manuscript

Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease

Download PDF

Alexandra Grob¹,
Jonas Rohr¹,
Vittorio Stumpo¹,
Moira Vieli¹,
Olga Ciobanu-Caraus¹,
Luca Ricciardi²,
Nicolai Maldaner¹,
Antonino Raco²,
Massimo Miscusi²,
Andrea Perna³,
Luca Proietti^4,5,
Giorgio Lofrese⁶,
Michele Dughiero⁶,
Francesco Cultrera⁶,
Marcello D’Andrea⁶,
Seong Bae An⁷,
Yoon Ha⁷,
Aymeric Amelot^8,9,
Jorge Bedia Cadelo¹⁰,
Jose M. Viñuela-Prieto¹⁰,
Maria L. Gandía-González¹⁰,
Pierre-Pascal Girod¹¹,
Sara Lener¹²,
Nikolaus Kögl¹²,
Anto Abramovic¹²,
Christoph J. Laux¹³,
Mazda Farshad¹³,
Dave O’Riordan¹⁴,
Markus Loibl¹⁵,
Fabio Galbusera¹⁴,
Anne F. Mannion¹⁴,
Alba Scerrati¹⁶,
Pasquale De Bonis¹⁶,
Granit Molliqaj¹⁷,
Enrico Tessitore¹⁷,
Marc L. Schröder¹⁸,
Martin N. Stienen¹⁹,
Luca Regli¹,
Carlo Serra¹ &
…
Victor E. Staartjes ORCID: orcid.org/0000-0003-1039-2098¹

773 Accesses
Explore all metrics

Abstract

Background

Clinical prediction models (CPM), such as the SCOAP-CERTAIN tool, can be utilized to enhance decision-making for lumbar spinal fusion surgery by providing quantitative estimates of outcomes, aiding surgeons in assessing potential benefits and risks for each individual patient. External validation is crucial in CPM to assess generalizability beyond the initial dataset. This ensures performance in diverse populations, reliability and real-world applicability of the results. Therefore, we externally validated the tool for predictability of improvement in oswestry disability index (ODI), back and leg pain (BP, LP).

Methods

Prospective and retrospective data from multicenter registry was obtained. As outcome measure minimum clinically important change was chosen for ODI with ≥ 15-point and ≥ 2-point reduction for numeric rating scales (NRS) for BP and LP 12 months after lumbar fusion for degenerative disease. We externally validate this tool by calculating discrimination and calibration metrics such as intercept, slope, Brier Score, expected/observed ratio, Hosmer–Lemeshow (HL), AUC, sensitivity and specificity.

Results

We included 1115 patients, average age 60.8 ± 12.5 years. For 12-month ODI, area-under-the-curve (AUC) was 0.70, the calibration intercept and slope were 1.01 and 0.84, respectively. For NRS BP, AUC was 0.72, with calibration intercept of 0.97 and slope of 0.87. For NRS LP, AUC was 0.70, with calibration intercept of 0.04 and slope of 0.72. Sensitivity ranged from 0.63 to 0.96, while specificity ranged from 0.15 to 0.68. Lack of fit was found for all three models based on HL testing.

Conclusions

Utilizing data from a multinational registry, we externally validate the SCOAP-CERTAIN prediction tool. The model demonstrated fair discrimination and calibration of predicted probabilities, necessitating caution in applying it in clinical practice. We suggest that future CPMs focus on predicting longer-term prognosis for this patient population, emphasizing the significance of robust calibration and thorough reporting.

External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion

Article 22 October 2019

FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease

Article Open access 21 February 2022

Predictors of poor outcome following lumbar spinal fusion surgery: a prospective observational study to derive two clinical prediction rules using British Spine Registry data

Article 26 May 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Medical Imaging

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The most prevalent symptoms associated with indication for lumbar spinal fusion are back and leg pain (BP, LP), along with compromised functionality [1]. Nevertheless, the clarity of data regarding the effectiveness of surgical interventions for individual patients in alleviating these symptoms appears to be lacking when considering all the patient's characteristics [2, 3]. Especially in patients with degenerative disease of the lumbar spine, and even more so in patients with discogenic chronic low back pain, some patients profit massively from fusion surgery, while others experience no difference at all or even worsen. On average, there is evidence that spinal fusion in this population is no better than conservative management [4, 5]. To address this issue in the future and enhance the ability to distinguish, which patients may benefit more from lumbar spinal fusion, Khor et al. [6] introduced the SCOAP-CERTAIN tool in 2018. This tool aims to determine the probabilities of improvement in function, back pain, and leg pain for lumbar fusion candidates one year after surgery [6]. These models demonstrated good accuracy in both the development and internal validation cohorts, making them potentially suitable for integration into everyday clinical practice. Still, the importance of rigorous validation of CPMs on multicenter, data from different populations (external validation) cannot be stressed enough: Only through proper external validation the reliability and clinical applicability of CPMs can be ensured [7, 8]. Up to now, the SCOAP-CERTAIN tool has only been validated in a single Dutch center with 100 patients, revealing good discrimination but rather poor calibration [9]. As predictive probabilities hold more significance for clinicians and patients than binary classifications in making decisions about surgery, it might be premature to apply the current prediction tool in clinical practice. Hence, we aimed to conduct a comprehensive external validation study involving 1115 patients from multiple centers to reevaluate the predictive ability of the Khor et al. [6] model regarding improvement in function and pain following lumbar spinal fusion for degenerative disease.

Materials and methods

Overview

A dataset comprising 1115 patients who underwent elective lumbar spinal fusion for degenerative disease from a multinational study (FUSE-ML) [10] was utilized to externally validate the machine learning-based model published by Khor et al. [6]. This model predicts improvement in functional outcome (Oswestry Disability Index, ODI), back, and leg pain. We compared the values predicted by their model with the true outcomes at 12 months after lumbar fusion in our cohort, providing a rigorous multicenter external validation of this model. Approval for the utilization of patient data in research was granted by individual local institutional review boards (IRBs) of FUSE-ML centers. Patients either gave informed consent, or the requirement for informed consent was waived as per the local IRB’s stipulations.

Patient population

Data were extracted from a prospective registry that included patients undergoing elective thoracolumbar pedicle screw placement for up to 6 levels, addressing degenerative pathologies such as spinal stenosis, spondylolisthesis, disc herniation, failed back surgery syndrome (FBSS), radiculopathy, or pseudarthrosis. Patients were excluded if when the primary surgical indication was one of the following: infections, spinal tumors, fractures (traumatic or osteoporotic) or deformity surgery for scoliosis. Additionally, patients with moderate or severe scoliosis (Coronal Cobb’s > 30°/Schwab classification sagittal modifier + or + +), those with missing outcome data at 12 months, a lack of informed consent, or those younger than 18 years were excluded. Our manuscript has been developed in line with the guidelines outlined in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement [11].

Data collection

Data preparation adhered to the guidelines established by Khor et al. [6]. Primary clinical and radiological information was obtained during the initial outpatient appointment. Subsequently, patients underwent a comprehensive clinical examination and magnetic resonance imaging (MRI). Collected outcome parameters included Oswestry Disability Index (ODI), numeric rating scale for back (NRS-BP) and leg (NRS-LP) pain, separately, as well as gender, age, smoking status, ethnicity, American Society of Anesthesiologists (ASA) grade, opioid consumption, presence of asthma, and prior spine surgery. In one center, functional outcome was assessed using the Core Outcome Measures Index (COMI), which was converted into the ODI according to a validated mapping function [12]. The clinical outcome parameters ODI, NRS-BP, and NRS-LP were collected again at the 12-month follow-up.

Outcome measures

ODI, NRS-BP and -LP were collected in the form of a standardized questionnaire with values ranging from 0 to 100 [13], and 0–10 [14], respectively, with higher values representing increasing severity in functional disability or pain. As with Khor et al. [6], we established the term clinical improvement as achievement of the minimum clinically important change (MCIC) threshold of a ≥ 15-point reduction for ODI and a ≥ 2-point reduction for NRS-BP and -LP [15, 16].

Statistical analysis

Missingness is reported in Supplementary Table 1. To address the absence of data in the predictor variables, which were presumed to be missing at random, we conducted imputation using a k-nearest-neighbor approach[17]. Patients with ODI of lower than 15 or NRS of lower than 2 were removed from the respective analyses as specified by Khor et al.[6], as these patients have no way of achieving MCIC in the respective outcome. The three CPMs of the SCOAP-CERTAIN tool were then reconstructed using the reported model parameters and intercepts. We calculated the Area under the Receiver Operating Characteristics Curve (AUC) by comparing predicted probabilities with the actual MCIC outcome at the 12-month mark. Calibration was evaluated using both visual inspection of calibration curves and quantitative analysis, including the calibration intercept and slope (optimal calibration intercept: 0; optimal calibration slope: 1). Calibration assesses the extent to which a model’s predicted probabilities, spanning from 0 to 100%, align with the observed incidence of the binary endpoint, which represents the true posterior [18]. Additionally, in terms of calibration, we examined expected/observed event ratios (E/O-ratios) which describes the overall calibration of a prediction model [7], the Brier Score [19], and the Estimated Calibration Index [20]. Likewise, the Hosmer–Lemeshow (HL) test was employed for assessing goodness-of-fit, which gauges whether the observed event rates align with the expected event rates within different population subgroups [21]. The binary classification threshold was set at 0.5, as this cutoff is most likely the one to be used by Khor et al. [6] and also appears suitable for the dataset based on the “closest-to-(0,1)-criterion”. Following this, we compared the binary classifications to the actual observed MCIC outcome in confusion matrices and calculated various performance metrics, including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the F1 Score. All continuous data are reported as mean ± standard deviation (SD). Whenever relevant, we offer bootstrapped 95% confidence intervals (CIs) using 1000 resamples with replacement. All analyses were performed using R Statistical Software (v 4.3.0; R Core Team 2023) [22].

Results

Overview

A total of 1115 patients were included in this study with a mean (SD) age of 60.8 (12.5; range 19–89) years, of which 455 (40.8%) were male. Patient characteristics and surgical parameters are presented in Table 1. Lumbar spinal stenosis accounted for 55.4% of indications for surgery, followed closely by spondylolisthesis with 53.7%. The number of patients with a high ASA score was 29.5%.

Table 1 Summary of patient characteristics and outcome measures

Full size table

Patient‑reported outcome

At the 12-month postoperative mark, there was a notable improvement in ODI scores, with a mean change of −21.8 ± 16.7 from baseline. Additionally, NRS-BP and NRS-LP showed improvements of −3.1 ± 2.4 and −2.5 ± 2.5, respectively. The MCIC was achieved by 68% of patients for ODI. Furthermore, NRS-BP and NRS-LP both saw MCIC achievement in 77% and 71% of patients, respectively. Table 1 provides a summary of the outcome measures for our external validation cohort.

Performance evaluation

Calibration

Table 2 shows a detailed list of calibration metrics of the external validation. In predicting the MCIC in ODI at 12 months, we observed a calibration intercept of 1.01 (95% CI 0.87–1.16) and a slope of 0.84 (95% CI 0.68–1.01), along with a HL p-value < 0.001 (refer to Fig. 1A). The low E/O-ratio of 0.75 (95% CI 0.71–0.79) suggests that the model tended to underestimate the likelihood of a favorable outcome. In our multicenter cohort, Brier Score achieved a moderate accuracy of the probability forecast of 0.22 (95% CI 0.21–0.23).

Table 2 Calibration performance metrics of the three prediction models on external data

Full size table

Similarly, when predicting MCIC in NRS-BP, we identified a calibration intercept of 0.97 (95% CI 0.80–1.15), a slope of 0.87 (95% CI 0.70–1.08), and an E/O-ratio of 0.97 (95% CI 0.93–1.00), with a corresponding HL p-value < 0.001 (as depicted in Fig. 1B). For the prediction of MCIC in NRS-LP within our cohort, we found a calibration intercept of 0.04 (95% CI −0.14–0.22) and a slope of 0.72 (95% CI 0.55–0.90). The HL p-value was < 0.001. Notably, the model appeared to overestimate the likelihood of a favorable outcome applied on our cohort when examining the calibration plot (refer to Fig. 1C), which was further supported by the high E/O-ratio of 1.16 (95% CI 1.13–1.20). Brier Scores for back and leg pain showed increased accuracy compared to the ODI with 0.16 (95% CI 0.15—0.18) and 0.15 (95% CI 0.13—0.16), respectively.

Discrimination

Table 3 provides a comprehensive overview of discrimination measures, while Fig. 2A–C illustrate the AUC values for each individual center for the three models during external validation. When predicting MCIC for ODI, we achieved an AUC of 0.70 (95% CI 0.67–0.74), with a sensitivity of 0.63 (95% CI 0.59–0.66) and specificity of 0.68 (95% CI 0.62–0.73). Similarly, in predicting MCIC for NRS-BP, we obtained AUC values of 0.72 (95% CI 0.68–0.76), sensitivity at 0.84 (95% CI 0.82–0.87), and specificity at 0.45 (95% CI 0.38–0.52) during external validation. Finally, for predicting NRS-LP, the AUC reached 0.70 (95% CI 0.66–0.74), with a very high sensitivity of 0.96 (95% CI 0.94–0.97) and low specificity of 0.15 (95% CI 0.10–0.20).

Table 3 Discrimination performance metrics of the three prediction models on external data

Full size table

Discussion

To address the problem of significant variability in postoperative outcome after lumbar fusion surgery due to a wide range of patient characteristics [5], CPMs were developed assisting in the decision-making process [23]. Khor’s model demonstrated good calibration and performance in their own, internal validation cohort [6] with comparable values in a small single-center external validation cohort [9]. Here, we performed a rigorous, multicenter external validation of Khor’s models (coined the SCOAP-CERTAIN tool) for prediction achieving the MCIC for 3 different clinical outcomes at 12 months postoperatively after lumbar fusion for degenerative disease. With data from the FUSE-ML study, we assess generalization of these CPMs and find that – while in terms of discrimination (binary prediction) the models generalize moderately well – the calibration (continuous risk assessment) seems to lack in robustness, although the cohorts appear comparable.

It is notoriously difficult to predict treatment response for patients undergoing lumbar spinal fusion for degenerative disease. While some indications such as isthmic spondylolisthesis represent a relatively clear indication for fusion, others such as low-grade degenerative spondylolisthesis with stenosis are less clear to benefit from addition of fusion [24, 25] The most extreme example certainly is chronic low back pain with concomitant discopathy [26]. While some individual patients with this pathology do profit from fusion, an unselected population does not: Randomized studies consistently indicate that, on the whole, fusion surgery does not yield significantly superior outcomes compared to conservative treatment for chronic low back pain [27]. Although surgery may not exhibit a clear advantage over conservative approaches in unselected patients with chronic low back pain, specific subsets of patients can genuinely experience benefits [28]. The critical factor for success in degenerative spine surgery lies in meticulous patient selection.

In the past, different methods were established to help select the best treatment option of the individual patient. From discography to pantaloon casting or considering radiological modifiers such as Modic-type endplate changes, many potential predicts of surgical success were evaluated, but often with very limited predictive ability [26, 28]. First, mostly radiological or physician-based outcomes were assessed, but over time, patient-reported outcome measures (PROMs) such as ODI [29] were implemented and validated trying to quantify and weigh symptoms to in the end justify risk and benefits of a potential surgery [30]. This then opened up the possibility of truly personalized medicine: Currently, the aim and idea of medical decisions is to consider every personal aspect of a patient’s physical and mental characteristics for the perfect treatment to fulfill the wide range of demanding aspects, such as symptom release for the patient, healing or preventing progression of a disease and balancing costs of the healthcare system by avoiding unnecessary diagnostics and treatments and complications [31, 32]. Another delicate aspect complicating medical decision making, is the wide range of symptoms that can be present in patients with degenerative lumbar spine diseases, e.g. facet-mediated pain, discogenic pain or myofascial pain [33], among others. The easiest would be, if we could pinpoint specific symptoms or patient characteristics, knowing that lumbar fusion would ease this symptom. With more information regarding the patient and e.g. the comorbidities to weigh up the risks of surgery in general versus the expected benefit, this could lead to improved risk–benefit counseling during clinics [34].

Thus, the aim of CPMs in the surgical field is to tell, which patients do benefit of a certain intervention, and which do not. Khor et al. [6] have published an internally validated CPM tool (SCOAP-CERTAIN) that aims at assisting in surgical decision making by providing predictive analytics on which patients scheduled for lumbar spinal fusion for degenerative disease are most likely to show significant 12-month improvement in functional outcome and pain severity. Rigorous multicenter/multicultural external validation is a crucial process necessary before clinical implementation of CPMs [7, 8, 35]. To assess generalization of a CPM, calibration and discrimination need to be quantified [36]. Discrimination refers to a model’s capacity to precisely categorize patients in a binary way, namely into those experiencing MCIC and those who do not see a clinically relevant improvement. On the other hand, the model’s capability to generate accurate predicted probabilities (between 0 and 1) that closely align with the true posterior (observed frequency) is termed calibration. The SCOAP-CERTAIN tool had previously been evaluated in a small single-center external validation study of Dutch patients, demonstrating adequate discrimination but only fair calibration [9]. In a previous study of the FUSE-ML study group, a second, simpler CPM for the same outcomes was developed, with the goal of achieving similar predictive power with a lower amount of input variables [10]. This goal was broadly achieved, and within that study, a small external validation (in three centers with a total of 298 patients) of the SCOAP-CERTAIN tool (with the goal of comparing both CPMs performances) was carried out, showing again relatively robust discrimination but only fair calibration of both models [10].

Although CPMs in degenerative spine surgery could in theory be highly beneficial if added into the clinical context, rigorous external validation is necessary first to make sure that models are not “let loose too early” [8, 34]. It is especially necessary to not only test models in one or two small cohorts, but rather in a wide range of different patient populations from multiple countries and continents – If performance then shows itself to be robust, it can be safely assumed that the CPM will achieve the expected predictive performance in real-world patients, and the model can be safely rolled out. In the present study, we performed such an extensive external validation study. With AUC between 0.70 and 0.72 in ODI, NRS-BP and NRS-LP we were able to show good discrimination metrics, comparing with those reported in Khor et al.’s initial internal validation study (0.66–0.79) [6]. Yet, calibration – evaluated through diverse metrics – again demonstrated only moderate performance, as in the previous small external validation studies. In the context of internal validation, Khor et al. had documented calibration intercepts ranging from −0.02 to 0.16, along with slopes spanning 0.80–1.05, whereas we reached a wider range from 0.04 to 1.01 for intercepts and less well calibrated values with 0.72–0.87 for slopes, respectively – even though outcome distribution was similar to the development cohort (as it is known that calibration intercepts are highly dependent on differences in outcome distribution) [37]. Summarizing, there was substantial heterogeneity in the observed calibration slopes along with a higher ECI, a measure of overall calibration, defined as the average squared difference of the predicted probabilities with their grouped estimated observed probabilities [18]; and clearly worse testing for goodness-of-fit by the method of Hosmer and Lemeshow [21]. The HL method is based on dividing the sample up according to usually 10 groups of predicted probabilities and then compares the distribution to a chi-square distribution with a p-value > 0.2 usually being seen as an indication of fair calibration/goodness-of-fit [18, 21]. Of course – as is the goal of external validation – our external validation cohort represents a much more heterogenous population than the development cohort, now including European and Asian individuals, which explains some of the lack of generalization in terms of calibration. In the realm of CPMs, calibration might arguably carry a more significant role than discrimination alone [37]. This is because clinicians and patients are typically more concerned with predicted probabilities of a specific endpoint rather than a binary classification – individual patients, after all, are not binary, but carry a spectrum of expected risks and benefits [7]. Hence, insufficient calibration poses a significant obstacle to the clinical and external applicability of prediction models. Another potential explanation of the poor generalization in terms of calibration can be explained by different definitions of input variables: Although our data collection adhered strictly to the definitions provided by the Khor et al. [6] paper, institutional protocols and inter-rater assessment still varies. This is one of the general limitations of CPMs based on tabulated medical data: Because data must first undergo multiple stages of summarization and simplification by human healthcare providers, the overall predictive power can quickly reach “ceiling effects” due to the input heterogeneity. This is another reason why external validation is so crucial: To test out whether CPMs work just as well if applied in a real-world environment (effectiveness vs. efficacy). In the future, direct inclusion of source data (such as MRI) without human coding, or automated data collection through natural language processing, might somewhat alleviate this bottleneck [38].

Still, even if not perfectly calibrated in a rigorous external validation study, the models published by Khor et al. [6] are admirable and show good generalization overall, especially in terms of discrimination performance – no signs of overfitting can be observed here. Overfitting manifests as a relevant difference between training and testing performance in terms of discrimination [35]. It is common for out-of-sample performance to be comparable to or slightly worse than the training performance for a well-fitted model. The observed discrimination performance in our external validation study fits this norm well. It can be concluded that the SCOAP-CERTAIN model can safely be applied in clinical practice, although it must be kept in mind that predicted probabilities (calibration) should only be used as rough estimates, and that binary predictions – while generalizing well (discrimination) – still are no more accurate than an AUC of around 0.70.

In the end, in the realm of degenerative spine surgery, well-validated CPMs such as the SCOAP-CERTAIN [6] or FUSE-ML [10] models should only be used cautiously as rough estimates to offer an objective “second opinion” in the risk–benefit counseling of patients, but never as absolute red or green lights for surgical indications. We suggest that a future model should also be capable of predicting longer-term prognosis, as longer-term outcomes will improve the robustness of outcome data in lumbar patients. This could be achieved by incorporating more extended follow-up data and reducing short-term variability. These measures will lead to a more comprehensive understanding of patient trajectories, which is essential for effective clinical decision-making and enhanced calibration.

Additionally, it is crucial that future studies, as previously mentioned in the external preliminary stage, report key metrics such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the models assessed. Reporting these metrics will enable better differentiation and validation of the predicted values, thereby enhancing the reliability and applicability of clinical prediction models in practice.

Limitations

Regarding the primary surgical indication, our cohort showed mostly lumbar spinal stenosis, spondylolisthesis and discogenic low back pain whereas in Khor’s cohort radiculopathy was the leading diagnosis, followed also by stenosis and spondylolisthesis [6]. Of course, surgical indication and especially the chosen technique might vary between centers, which is exactly why multicenter external validation is important. Compared to the development cohort, we also included lateral techniques, which in turn brings a broader range of included patients. We apply a mixed cohort (FUSE-ML) of partially prospectively collected, and partially retrospectively collected data. It is known that the difference of these two strategies have a relevant influence on collected data – especially on complications, which is fortunately not a topic here – as well as on missingness, and could therefore affect final analysis, too [39]. Still, on the other hand, the fact that models still generalized relatively well on these heterogenous data is the point of external validation and even more so proves the robustness of the Khor et al. [6] models. Due to the lack of long-term (> 2 years follow-up) data, even with good calibration and discrimination performance, we are only able to predict short- and mid-term outcomes. More long-term data evaluation regarding CPMs is necessary. The validated models also do not predict surgical risks such as perioperative complications or long-term adjacent segment degeneration, information which would be particularly useful in risk–benefit discussions. The fact that FUSE-ML or SCOAP-CERTAIN models also are not able to provide prognosis of natural history or conservative treatment in these degenerative conditions means that they only provide half of the answer when making decisions on surgical versus conservative treatment strategies.

Conclusion

Utilizing data from a multinational registry, we externally validate the SCOAP-CERTAIN prediction tool. While the model demonstrated good discrimination, the calibration of predicted probabilities was only fair, necessitating caution in applying it in clinical practice. We propose that future CPMs consider predicting longer-term prognosis for this patient population, and emphasize the importance of rigorous external validation, robust calibration, as well as sound reporting.

References

Kepler CK et al (2014) National trends in the use of fusion techniques to treat degenerative spondylolisthesis. Spine 39(19):1584–1589. https://doi.org/10.1097/BRS.0000000000000486
Article PubMed Google Scholar
Ivar Brox J et al (2003) Randomized clinical trial of lumbar instrumented fusion and cognitive intervention and exercises in patients with chronic low back pain and disc degeneration. Spine 28(17):1913–1921. https://doi.org/10.1097/01.BRS.0000083234.62751.7A
Article PubMed Google Scholar
Fairbank J, Frost H, Wilson-MacDonald J, Yu L-M, Barker K, Collins R (2005) Randomised controlled trial to compare surgical stabilisation of the lumbar spine with an intensive rehabilitation programme for patients with chronic low back pain: the MRC spine stabilisation trial. BMJ 330(7502):1233. https://doi.org/10.1136/bmj.38441.620417.8F
Article PubMed PubMed Central Google Scholar
Birkmeyer NJO et al (2002) Design of the spine patient outcomes research trial (SPORT). Spine 27(12):1361–1372. https://doi.org/10.1097/00007632-200206150-00020
Article PubMed PubMed Central Google Scholar
Weinstein JN et al (2009) Surgical compared with nonoperative treatment for lumbar degenerative spondylolisthesis: four-year results in the spine patient outcomes research trial (SPORT) randomized and observational cohorts. J Bone Jt Surg-Am Vol 91(6):1295–1304. https://doi.org/10.2106/JBJS.H.00913
Article Google Scholar
Khor S et al (2018) Development and validation of a prediction model for pain and functional outcomes after lumbar spine surgery. JAMA Surg 153(7):634. https://doi.org/10.1001/jamasurg.2018.0072
Article PubMed PubMed Central Google Scholar
Riley RD et al (2016) External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. https://doi.org/10.1136/bmj.i3140
Article PubMed PubMed Central Google Scholar
Staartjes VE, Kernbach JM (2020) Significance of external validation in clinical machine learning: let loose too early? Spine J Off J North Am Spine Soc 20(7):1159–1160. https://doi.org/10.1016/j.spinee.2020.02.016
Article Google Scholar
Quddusi A et al (2020) External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion. Eur Spine J 29(2):374–383. https://doi.org/10.1007/s00586-019-06189-6
Article PubMed Google Scholar
Staartjes VE et al (2022) FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease. Eur Spine J Off Publ Eur Spine Soc Eur Spinal Deform Soc Eur Sect Cerv Spine Res Soc 31(10):2629–2638. https://doi.org/10.1007/s00586-022-07135-9
Article Google Scholar
Collins GS, Ogundimu EO, Altman DG (2016) Sample size considerations for the external validation of a multivariable prognostic model: a resampling study: sample size considerations for validating a prognostic model. Stat Med 35(2):214–226. https://doi.org/10.1002/sim.6787
Article PubMed Google Scholar
Mannion AF et al (2022) Development of a mapping function (“crosswalk”) for the conversion of scores between the oswestry disability index (ODI) and the core outcome measures index (COMI). Eur Spine J 31(12):3337–3346. https://doi.org/10.1007/s00586-022-07434-1
Article CAS PubMed Google Scholar
Fairbank JCT, Pynsent PB (2000) The oswestry disability index. Spine 25(22):2940–2953. https://doi.org/10.1097/00007632-200011150-00017
Article CAS PubMed Google Scholar
Childs JD, Piva SR, Fritz JM (2005) Responsiveness of the numeric pain rating scale in patients with low back pain. Spine 30(11):1331–1334. https://doi.org/10.1097/01.brs.0000164099.92112.29
Article PubMed Google Scholar
Fekete TF, Haschtmann D, Kleinstück FS, Porchet F, Jeszenszky D, Mannion AF (2016) What level of pain are patients happy to live with after surgery for lumbar degenerative disorders? Spine J 16(4):S12–S18. https://doi.org/10.1016/j.spinee.2016.01.180
Article PubMed Google Scholar
Ostelo RWJG et al (2008) Interpreting change scores for pain and functional status in low back pain: towards international consensus regarding minimal important change. Spine 33(1):90–94. https://doi.org/10.1097/BRS.0b013e31815e3a10
Article PubMed Google Scholar
Templ M, Kowarik A, Alfons A, Prantner B (2019) VIM: visualization and imputation of missing values. https://CRAN.R-project.org/package=VIM Accessed 5 Jan 2020
Staartjes V E, Regli L, Serra C (2022) Machine learning in clinical neuroscience: foundations and applications, In: Acta neurochirurgica supplement, vol 134. Cham: Springer International Publishing https://doi.org/10.1007/978-3-030-85292-4
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3. https://doi.org/10.1175/1520-0493(1950)078%3c0001:VOFEIT%3e2.0.CO;2
Article Google Scholar
Van Hoorde K, Van Huffel S, Timmerman D, Bourne T, Van Calster B (2015) A spline-based tool to assess and visualize the calibration of multiclass risk predictions. J Biomed Inform 54:283–293. https://doi.org/10.1016/j.jbi.2014.12.016
Article PubMed Google Scholar
Hosmer D W, Lemeshow S, Sturdivant R X (2013) Applied logistic regression, In: Wiley series in probability and statistics. Wiley https://doi.org/10.1002/9781118548387
R Core Team (2023) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria
Google Scholar
Senders JT et al (2018) Machine learning and neurosurgical outcome prediction: a systematic review. World Neurosurg 109:476–486. https://doi.org/10.1016/j.wneu.2017.09.149
Article PubMed Google Scholar
Ghogawala Z, Barker FG, Benzel EC (2016) Fusion surgery for lumbar spinal stenosis. N Engl J Med 375(6):600–601
PubMed Google Scholar
Försth P et al (2016) A randomized, controlled trial of fusion surgery for lumbar spinal stenosis. N Engl J Med 374(15):1413–1423. https://doi.org/10.1056/NEJMoa1513721
Article CAS PubMed Google Scholar
Staartjes VE, Vergroesen P-PA, Zeilstra DJ, Schröder ML (2018) Identifying subsets of patients with single-level degenerative disc disease for lumbar fusion: the value of prognostic tests in surgical decision making. Spine J 18(4):558–566. https://doi.org/10.1016/j.spinee.2017.08.242
Article PubMed Google Scholar
Mannion AF, Brox J-I, Fairbank JC (2016) Consensus at last! Long-term results of all randomized controlled trials show that fusion is no better than non-operative care in improving pain and disability in chronic low back pain. Spine J 16(5):588–590. https://doi.org/10.1016/j.spinee.2015.12.001
Article PubMed Google Scholar
Willems P (2013) Decision making in surgical treatment of chronic low back pain: the performance of prognostic tests to select patients for lumbar spinal fusion. Acta Orthop 84(sup349):1–37. https://doi.org/10.3109/17453674.2012.753565
Article Google Scholar
Van Hooff ML, Mannion AF, Staub LP, Ostelo RWJG, Fairbank JCT (2016) Determination of the oswestry disability index score equivalent to a “satisfactory symptom state” in patients undergoing surgery for degenerative disorders of the lumbar spine—a spine tango registry-based study. Spine J 16(10):1221–1230. https://doi.org/10.1016/j.spinee.2016.06.010
Article PubMed Google Scholar
Falavigna A et al (2017) Current status of worldwide use of patient-reported outcome measures (PROMs) in spine care. World Neurosurg 108:328–335. https://doi.org/10.1016/j.wneu.2017.09.002
Article PubMed Google Scholar
Kim JS et al (2018) Examining the ability of artificial neural networks machine learning models to accurately predict complications following posterior lumbar spine fusion. Spine 43(12):853–860. https://doi.org/10.1097/BRS.0000000000002442
Article PubMed PubMed Central Google Scholar
Ehlers AP et al (2017) Improved risk prediction following surgery using machine learning algorithms. EGEMs Gener Evid Methods Improve Patient Outcomes 5(2):3. https://doi.org/10.13063/2327-9214.1278
Article Google Scholar
Mattei TA, Rehman AA, Teles AR, Aldag JC, Dinh DH, McCall TD (2017) The ‘lumbar fusion outcome score’ (LUFOS): a new practical and surgically oriented grading system for preoperative prediction of surgical outcomes after lumbar spinal fusion in patients with degenerative disc disease and refractory chronic axial low back pain. Neurosurg Rev 40(1):67–81. https://doi.org/10.1007/s10143-016-0751-6
Article PubMed Google Scholar
Steinmetz MP, Mroz T (2018) Value of adding predictive clinical decision tools to spine surgery. JAMA Surg. https://doi.org/10.1001/jamasurg.2018.0078
Article PubMed Google Scholar
Kernbach JM, Staartjes VE (2022) Foundations of machine learning-based clinical prediction modeling: part II—generalization and overfitting. machine learning in clinical neuroscience. In: Staartjes VE, Regli L, Serra C (eds) Acta neurochirurgica supplement, vol 134. Springer International Publishing, Cham, pp 15–21. https://doi.org/10.1007/978-3-030-85292-4_3
Chapter Google Scholar
Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 350(jan07 4):g7594–g7594. https://doi.org/10.1136/bmj.g7594
Article CAS PubMed Google Scholar
Staartjes VE, Kernbach JM (2020) Importance of calibration assessment in machine learning-based predictive analytics. J Neurosurg Spine 32(6):985–987
Article PubMed Google Scholar
Staartjes VE, Stienen MN (2019) Data mining in spine surgery: leveraging electronic health records for machine learning and clinical research. Neurospine 16(4):654–656. https://doi.org/10.14245/ns.1938434.217
Article PubMed PubMed Central Google Scholar
Nagurney JT (2005) The accuracy and completeness of data collected by prospective and retrospective methods. Acad Emerg Med 12(9):884–895. https://doi.org/10.1197/j.aem.2005.04.021
Article PubMed Google Scholar

Download references

Acknowledgements

We thank the patients whose anonymized data were used for this research.

Funding

Open access funding provided by University of Zurich. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Machine Intelligence in Clinical Neuroscience and Microsurgical Neuroanatomy (MICN) Laboratory, Department of Neurosurgery, Clinical Neuroscience Center, University Hospital Zurich, University of Zurich, Zurich, Switzerland
Alexandra Grob, Jonas Rohr, Vittorio Stumpo, Moira Vieli, Olga Ciobanu-Caraus, Nicolai Maldaner, Luca Regli, Carlo Serra & Victor E. Staartjes
Department of NESMOS, Azienda Ospedaliera Universitaria Sant’Andrea, Sapienza University, Rome, Italy
Luca Ricciardi, Antonino Raco & Massimo Miscusi
Department of Orthopedics, Foundation Casa Sollievo Della Sofferenza IRCCS, San Giovanni Rotondo, Italy
Andrea Perna
Department of Aging, Neurological, Orthopedic and Head-Neck Sciences, IRCCS A. Gemelli University Polyclinic Foundation, Rome, Italy
Luca Proietti
Department of Geriatrics and Orthopedics, Sacred Heart Catholic University, Rome, Italy
Luca Proietti
Neurosurgery Division, Department of Neurosciences, “M.Bufalini” Hospital, Cesena, Italy
Giorgio Lofrese, Michele Dughiero, Francesco Cultrera & Marcello D’Andrea
Department of Neurosurgery, Spine and Spinal Cord Institute, College of Medicine, Severance Hospital, Yonsei University, Seoul, Korea
Seong Bae An & Yoon Ha
Department of Neurosurgery, La Pitié Salpétrière Hospital, Paris, France
Aymeric Amelot
Neurosurgical Spine Department, University Hospital of Tours, Tours, France
Aymeric Amelot
Department of Neurosurgery, Hospital Universitario La Paz, Madrid, Spain
Jorge Bedia Cadelo, Jose M. Viñuela-Prieto & Maria L. Gandía-González
Department of Neurosurgery, Vienna Healthcare Network/ Municipial Hospital, Vienna, Austria
Pierre-Pascal Girod
Department of Neurosurgery, Medical University of Innsbruck, Innsbruck, Austria
Sara Lener, Nikolaus Kögl & Anto Abramovic
University Spine Center, Balgrist University Hospital, University of Zurich, Zurich, Switzerland
Christoph J. Laux & Mazda Farshad
Spine Center Division, Department of Teaching, Research and Development, Schulthess Klinik, Zurich, Switzerland
Dave O’Riordan, Fabio Galbusera & Anne F. Mannion
Department of Spine Surgery, Schulthess Klinik, Zurich, Switzerland
Markus Loibl
Department of Neurosurgery, University Hospital Sant’Anna, Ferrara, Italy
Alba Scerrati & Pasquale De Bonis
Department of Neurosurgery, HUG Geneva University Hospital, Geneva, Switzerland
Granit Molliqaj & Enrico Tessitore
Department of Neurosurgery, Bergman Clinics Amsterdam, Amsterdam, The Netherlands
Marc L. Schröder
Department of Neurosurgery and Spine Center of Eastern Switzerland, Cantonal Hospital St. Gallen and Medical School of St.Gallen, St. Gallen, Switzerland
Martin N. Stienen

Authors

Alexandra Grob
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Rohr
View author publications
You can also search for this author in PubMed Google Scholar
Vittorio Stumpo
View author publications
You can also search for this author in PubMed Google Scholar
Moira Vieli
View author publications
You can also search for this author in PubMed Google Scholar
Olga Ciobanu-Caraus
View author publications
You can also search for this author in PubMed Google Scholar
Luca Ricciardi
View author publications
You can also search for this author in PubMed Google Scholar
Nicolai Maldaner
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Raco
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Miscusi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Perna
View author publications
You can also search for this author in PubMed Google Scholar
Luca Proietti
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Lofrese
View author publications
You can also search for this author in PubMed Google Scholar
Michele Dughiero
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Cultrera
View author publications
You can also search for this author in PubMed Google Scholar
Marcello D’Andrea
View author publications
You can also search for this author in PubMed Google Scholar
Seong Bae An
View author publications
You can also search for this author in PubMed Google Scholar
Yoon Ha
View author publications
You can also search for this author in PubMed Google Scholar
Aymeric Amelot
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Bedia Cadelo
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. Viñuela-Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Maria L. Gandía-González
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Pascal Girod
View author publications
You can also search for this author in PubMed Google Scholar
Sara Lener
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Kögl
View author publications
You can also search for this author in PubMed Google Scholar
Anto Abramovic
View author publications
You can also search for this author in PubMed Google Scholar
Christoph J. Laux
View author publications
You can also search for this author in PubMed Google Scholar
Mazda Farshad
View author publications
You can also search for this author in PubMed Google Scholar
Dave O’Riordan
View author publications
You can also search for this author in PubMed Google Scholar
Markus Loibl
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Galbusera
View author publications
You can also search for this author in PubMed Google Scholar
Anne F. Mannion
View author publications
You can also search for this author in PubMed Google Scholar
Alba Scerrati
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale De Bonis
View author publications
You can also search for this author in PubMed Google Scholar
Granit Molliqaj
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Tessitore
View author publications
You can also search for this author in PubMed Google Scholar
Marc L. Schröder
View author publications
You can also search for this author in PubMed Google Scholar
Martin N. Stienen
View author publications
You can also search for this author in PubMed Google Scholar
Luca Regli
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Serra
View author publications
You can also search for this author in PubMed Google Scholar
Victor E. Staartjes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victor E. Staartjes.

Ethics declarations

Conflict of interest

The authors declare that the article and its content were composed in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 48 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grob, A., Rohr, J., Stumpo, V. et al. Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease. Eur Spine J 33, 3534–3544 (2024). https://doi.org/10.1007/s00586-024-08395-3

Download citation

Received: 22 April 2024
Revised: 18 June 2024
Accepted: 30 June 2024
Published: 11 July 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s00586-024-08395-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease

Abstract

Background

Methods

Results

Conclusions

Similar content being viewed by others

External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion

FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease

Predictors of poor outcome following lumbar spinal fusion surgery: a prospective observational study to derive two clinical prediction rules using British Spine Registry data

Explore related subjects

Introduction

Materials and methods

Overview

Patient population

Data collection

Outcome measures

Statistical analysis

Results

Overview

Patient‑reported outcome

Performance evaluation

Calibration

Discrimination

Discussion

Limitations

Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 48 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation