Abstract
Background
Clinical prediction models (CPM), such as the SCOAP-CERTAIN tool, can be utilized to enhance decision-making for lumbar spinal fusion surgery by providing quantitative estimates of outcomes, aiding surgeons in assessing potential benefits and risks for each individual patient. External validation is crucial in CPM to assess generalizability beyond the initial dataset. This ensures performance in diverse populations, reliability and real-world applicability of the results. Therefore, we externally validated the tool for predictability of improvement in oswestry disability index (ODI), back and leg pain (BP, LP).
Methods
Prospective and retrospective data from multicenter registry was obtained. As outcome measure minimum clinically important change was chosen for ODI with ≥ 15-point and ≥ 2-point reduction for numeric rating scales (NRS) for BP and LP 12 months after lumbar fusion for degenerative disease. We externally validate this tool by calculating discrimination and calibration metrics such as intercept, slope, Brier Score, expected/observed ratio, Hosmer–Lemeshow (HL), AUC, sensitivity and specificity.
Results
We included 1115 patients, average age 60.8 ± 12.5 years. For 12-month ODI, area-under-the-curve (AUC) was 0.70, the calibration intercept and slope were 1.01 and 0.84, respectively. For NRS BP, AUC was 0.72, with calibration intercept of 0.97 and slope of 0.87. For NRS LP, AUC was 0.70, with calibration intercept of 0.04 and slope of 0.72. Sensitivity ranged from 0.63 to 0.96, while specificity ranged from 0.15 to 0.68. Lack of fit was found for all three models based on HL testing.
Conclusions
Utilizing data from a multinational registry, we externally validate the SCOAP-CERTAIN prediction tool. The model demonstrated fair discrimination and calibration of predicted probabilities, necessitating caution in applying it in clinical practice. We suggest that future CPMs focus on predicting longer-term prognosis for this patient population, emphasizing the significance of robust calibration and thorough reporting.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
The most prevalent symptoms associated with indication for lumbar spinal fusion are back and leg pain (BP, LP), along with compromised functionality [1]. Nevertheless, the clarity of data regarding the effectiveness of surgical interventions for individual patients in alleviating these symptoms appears to be lacking when considering all the patient's characteristics [2, 3]. Especially in patients with degenerative disease of the lumbar spine, and even more so in patients with discogenic chronic low back pain, some patients profit massively from fusion surgery, while others experience no difference at all or even worsen. On average, there is evidence that spinal fusion in this population is no better than conservative management [4, 5]. To address this issue in the future and enhance the ability to distinguish, which patients may benefit more from lumbar spinal fusion, Khor et al. [6] introduced the SCOAP-CERTAIN tool in 2018. This tool aims to determine the probabilities of improvement in function, back pain, and leg pain for lumbar fusion candidates one year after surgery [6]. These models demonstrated good accuracy in both the development and internal validation cohorts, making them potentially suitable for integration into everyday clinical practice. Still, the importance of rigorous validation of CPMs on multicenter, data from different populations (external validation) cannot be stressed enough: Only through proper external validation the reliability and clinical applicability of CPMs can be ensured [7, 8]. Up to now, the SCOAP-CERTAIN tool has only been validated in a single Dutch center with 100 patients, revealing good discrimination but rather poor calibration [9]. As predictive probabilities hold more significance for clinicians and patients than binary classifications in making decisions about surgery, it might be premature to apply the current prediction tool in clinical practice. Hence, we aimed to conduct a comprehensive external validation study involving 1115 patients from multiple centers to reevaluate the predictive ability of the Khor et al. [6] model regarding improvement in function and pain following lumbar spinal fusion for degenerative disease.
Materials and methods
Overview
A dataset comprising 1115 patients who underwent elective lumbar spinal fusion for degenerative disease from a multinational study (FUSE-ML) [10] was utilized to externally validate the machine learning-based model published by Khor et al. [6]. This model predicts improvement in functional outcome (Oswestry Disability Index, ODI), back, and leg pain. We compared the values predicted by their model with the true outcomes at 12 months after lumbar fusion in our cohort, providing a rigorous multicenter external validation of this model. Approval for the utilization of patient data in research was granted by individual local institutional review boards (IRBs) of FUSE-ML centers. Patients either gave informed consent, or the requirement for informed consent was waived as per the local IRB’s stipulations.
Patient population
Data were extracted from a prospective registry that included patients undergoing elective thoracolumbar pedicle screw placement for up to 6 levels, addressing degenerative pathologies such as spinal stenosis, spondylolisthesis, disc herniation, failed back surgery syndrome (FBSS), radiculopathy, or pseudarthrosis. Patients were excluded if when the primary surgical indication was one of the following: infections, spinal tumors, fractures (traumatic or osteoporotic) or deformity surgery for scoliosis. Additionally, patients with moderate or severe scoliosis (Coronal Cobb’s > 30°/Schwab classification sagittal modifier + or + +), those with missing outcome data at 12 months, a lack of informed consent, or those younger than 18 years were excluded. Our manuscript has been developed in line with the guidelines outlined in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement [11].
Data collection
Data preparation adhered to the guidelines established by Khor et al. [6]. Primary clinical and radiological information was obtained during the initial outpatient appointment. Subsequently, patients underwent a comprehensive clinical examination and magnetic resonance imaging (MRI). Collected outcome parameters included Oswestry Disability Index (ODI), numeric rating scale for back (NRS-BP) and leg (NRS-LP) pain, separately, as well as gender, age, smoking status, ethnicity, American Society of Anesthesiologists (ASA) grade, opioid consumption, presence of asthma, and prior spine surgery. In one center, functional outcome was assessed using the Core Outcome Measures Index (COMI), which was converted into the ODI according to a validated mapping function [12]. The clinical outcome parameters ODI, NRS-BP, and NRS-LP were collected again at the 12-month follow-up.
Outcome measures
ODI, NRS-BP and -LP were collected in the form of a standardized questionnaire with values ranging from 0 to 100 [13], and 0–10 [14], respectively, with higher values representing increasing severity in functional disability or pain. As with Khor et al. [6], we established the term clinical improvement as achievement of the minimum clinically important change (MCIC) threshold of a ≥ 15-point reduction for ODI and a ≥ 2-point reduction for NRS-BP and -LP [15, 16].
Statistical analysis
Missingness is reported in Supplementary Table 1. To address the absence of data in the predictor variables, which were presumed to be missing at random, we conducted imputation using a k-nearest-neighbor approach[17]. Patients with ODI of lower than 15 or NRS of lower than 2 were removed from the respective analyses as specified by Khor et al.[6], as these patients have no way of achieving MCIC in the respective outcome. The three CPMs of the SCOAP-CERTAIN tool were then reconstructed using the reported model parameters and intercepts. We calculated the Area under the Receiver Operating Characteristics Curve (AUC) by comparing predicted probabilities with the actual MCIC outcome at the 12-month mark. Calibration was evaluated using both visual inspection of calibration curves and quantitative analysis, including the calibration intercept and slope (optimal calibration intercept: 0; optimal calibration slope: 1). Calibration assesses the extent to which a model’s predicted probabilities, spanning from 0 to 100%, align with the observed incidence of the binary endpoint, which represents the true posterior [18]. Additionally, in terms of calibration, we examined expected/observed event ratios (E/O-ratios) which describes the overall calibration of a prediction model [7], the Brier Score [19], and the Estimated Calibration Index [20]. Likewise, the Hosmer–Lemeshow (HL) test was employed for assessing goodness-of-fit, which gauges whether the observed event rates align with the expected event rates within different population subgroups [21]. The binary classification threshold was set at 0.5, as this cutoff is most likely the one to be used by Khor et al. [6] and also appears suitable for the dataset based on the “closest-to-(0,1)-criterion”. Following this, we compared the binary classifications to the actual observed MCIC outcome in confusion matrices and calculated various performance metrics, including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the F1 Score. All continuous data are reported as mean ± standard deviation (SD). Whenever relevant, we offer bootstrapped 95% confidence intervals (CIs) using 1000 resamples with replacement. All analyses were performed using R Statistical Software (v 4.3.0; R Core Team 2023) [22].
Results
Overview
A total of 1115 patients were included in this study with a mean (SD) age of 60.8 (12.5; range 19–89) years, of which 455 (40.8%) were male. Patient characteristics and surgical parameters are presented in Table 1. Lumbar spinal stenosis accounted for 55.4% of indications for surgery, followed closely by spondylolisthesis with 53.7%. The number of patients with a high ASA score was 29.5%.
Patient‑reported outcome
At the 12-month postoperative mark, there was a notable improvement in ODI scores, with a mean change of −21.8 ± 16.7 from baseline. Additionally, NRS-BP and NRS-LP showed improvements of −3.1 ± 2.4 and −2.5 ± 2.5, respectively. The MCIC was achieved by 68% of patients for ODI. Furthermore, NRS-BP and NRS-LP both saw MCIC achievement in 77% and 71% of patients, respectively. Table 1 provides a summary of the outcome measures for our external validation cohort.
Performance evaluation
Calibration
Table 2 shows a detailed list of calibration metrics of the external validation. In predicting the MCIC in ODI at 12 months, we observed a calibration intercept of 1.01 (95% CI 0.87–1.16) and a slope of 0.84 (95% CI 0.68–1.01), along with a HL p-value < 0.001 (refer to Fig. 1A). The low E/O-ratio of 0.75 (95% CI 0.71–0.79) suggests that the model tended to underestimate the likelihood of a favorable outcome. In our multicenter cohort, Brier Score achieved a moderate accuracy of the probability forecast of 0.22 (95% CI 0.21–0.23).
Similarly, when predicting MCIC in NRS-BP, we identified a calibration intercept of 0.97 (95% CI 0.80–1.15), a slope of 0.87 (95% CI 0.70–1.08), and an E/O-ratio of 0.97 (95% CI 0.93–1.00), with a corresponding HL p-value < 0.001 (as depicted in Fig. 1B). For the prediction of MCIC in NRS-LP within our cohort, we found a calibration intercept of 0.04 (95% CI −0.14–0.22) and a slope of 0.72 (95% CI 0.55–0.90). The HL p-value was < 0.001. Notably, the model appeared to overestimate the likelihood of a favorable outcome applied on our cohort when examining the calibration plot (refer to Fig. 1C), which was further supported by the high E/O-ratio of 1.16 (95% CI 1.13–1.20). Brier Scores for back and leg pain showed increased accuracy compared to the ODI with 0.16 (95% CI 0.15—0.18) and 0.15 (95% CI 0.13—0.16), respectively.
Discrimination
Table 3 provides a comprehensive overview of discrimination measures, while Fig. 2A–C illustrate the AUC values for each individual center for the three models during external validation. When predicting MCIC for ODI, we achieved an AUC of 0.70 (95% CI 0.67–0.74), with a sensitivity of 0.63 (95% CI 0.59–0.66) and specificity of 0.68 (95% CI 0.62–0.73). Similarly, in predicting MCIC for NRS-BP, we obtained AUC values of 0.72 (95% CI 0.68–0.76), sensitivity at 0.84 (95% CI 0.82–0.87), and specificity at 0.45 (95% CI 0.38–0.52) during external validation. Finally, for predicting NRS-LP, the AUC reached 0.70 (95% CI 0.66–0.74), with a very high sensitivity of 0.96 (95% CI 0.94–0.97) and low specificity of 0.15 (95% CI 0.10–0.20).
Discussion
To address the problem of significant variability in postoperative outcome after lumbar fusion surgery due to a wide range of patient characteristics [5], CPMs were developed assisting in the decision-making process [23]. Khor’s model demonstrated good calibration and performance in their own, internal validation cohort [6] with comparable values in a small single-center external validation cohort [9]. Here, we performed a rigorous, multicenter external validation of Khor’s models (coined the SCOAP-CERTAIN tool) for prediction achieving the MCIC for 3 different clinical outcomes at 12 months postoperatively after lumbar fusion for degenerative disease. With data from the FUSE-ML study, we assess generalization of these CPMs and find that – while in terms of discrimination (binary prediction) the models generalize moderately well – the calibration (continuous risk assessment) seems to lack in robustness, although the cohorts appear comparable.
It is notoriously difficult to predict treatment response for patients undergoing lumbar spinal fusion for degenerative disease. While some indications such as isthmic spondylolisthesis represent a relatively clear indication for fusion, others such as low-grade degenerative spondylolisthesis with stenosis are less clear to benefit from addition of fusion [24, 25] The most extreme example certainly is chronic low back pain with concomitant discopathy [26]. While some individual patients with this pathology do profit from fusion, an unselected population does not: Randomized studies consistently indicate that, on the whole, fusion surgery does not yield significantly superior outcomes compared to conservative treatment for chronic low back pain [27]. Although surgery may not exhibit a clear advantage over conservative approaches in unselected patients with chronic low back pain, specific subsets of patients can genuinely experience benefits [28]. The critical factor for success in degenerative spine surgery lies in meticulous patient selection.
In the past, different methods were established to help select the best treatment option of the individual patient. From discography to pantaloon casting or considering radiological modifiers such as Modic-type endplate changes, many potential predicts of surgical success were evaluated, but often with very limited predictive ability [26, 28]. First, mostly radiological or physician-based outcomes were assessed, but over time, patient-reported outcome measures (PROMs) such as ODI [29] were implemented and validated trying to quantify and weigh symptoms to in the end justify risk and benefits of a potential surgery [30]. This then opened up the possibility of truly personalized medicine: Currently, the aim and idea of medical decisions is to consider every personal aspect of a patient’s physical and mental characteristics for the perfect treatment to fulfill the wide range of demanding aspects, such as symptom release for the patient, healing or preventing progression of a disease and balancing costs of the healthcare system by avoiding unnecessary diagnostics and treatments and complications [31, 32]. Another delicate aspect complicating medical decision making, is the wide range of symptoms that can be present in patients with degenerative lumbar spine diseases, e.g. facet-mediated pain, discogenic pain or myofascial pain [33], among others. The easiest would be, if we could pinpoint specific symptoms or patient characteristics, knowing that lumbar fusion would ease this symptom. With more information regarding the patient and e.g. the comorbidities to weigh up the risks of surgery in general versus the expected benefit, this could lead to improved risk–benefit counseling during clinics [34].
Thus, the aim of CPMs in the surgical field is to tell, which patients do benefit of a certain intervention, and which do not. Khor et al. [6] have published an internally validated CPM tool (SCOAP-CERTAIN) that aims at assisting in surgical decision making by providing predictive analytics on which patients scheduled for lumbar spinal fusion for degenerative disease are most likely to show significant 12-month improvement in functional outcome and pain severity. Rigorous multicenter/multicultural external validation is a crucial process necessary before clinical implementation of CPMs [7, 8, 35]. To assess generalization of a CPM, calibration and discrimination need to be quantified [36]. Discrimination refers to a model’s capacity to precisely categorize patients in a binary way, namely into those experiencing MCIC and those who do not see a clinically relevant improvement. On the other hand, the model’s capability to generate accurate predicted probabilities (between 0 and 1) that closely align with the true posterior (observed frequency) is termed calibration. The SCOAP-CERTAIN tool had previously been evaluated in a small single-center external validation study of Dutch patients, demonstrating adequate discrimination but only fair calibration [9]. In a previous study of the FUSE-ML study group, a second, simpler CPM for the same outcomes was developed, with the goal of achieving similar predictive power with a lower amount of input variables [10]. This goal was broadly achieved, and within that study, a small external validation (in three centers with a total of 298 patients) of the SCOAP-CERTAIN tool (with the goal of comparing both CPMs performances) was carried out, showing again relatively robust discrimination but only fair calibration of both models [10].
Although CPMs in degenerative spine surgery could in theory be highly beneficial if added into the clinical context, rigorous external validation is necessary first to make sure that models are not “let loose too early” [8, 34]. It is especially necessary to not only test models in one or two small cohorts, but rather in a wide range of different patient populations from multiple countries and continents – If performance then shows itself to be robust, it can be safely assumed that the CPM will achieve the expected predictive performance in real-world patients, and the model can be safely rolled out. In the present study, we performed such an extensive external validation study. With AUC between 0.70 and 0.72 in ODI, NRS-BP and NRS-LP we were able to show good discrimination metrics, comparing with those reported in Khor et al.’s initial internal validation study (0.66–0.79) [6]. Yet, calibration – evaluated through diverse metrics – again demonstrated only moderate performance, as in the previous small external validation studies. In the context of internal validation, Khor et al. had documented calibration intercepts ranging from −0.02 to 0.16, along with slopes spanning 0.80–1.05, whereas we reached a wider range from 0.04 to 1.01 for intercepts and less well calibrated values with 0.72–0.87 for slopes, respectively – even though outcome distribution was similar to the development cohort (as it is known that calibration intercepts are highly dependent on differences in outcome distribution) [37]. Summarizing, there was substantial heterogeneity in the observed calibration slopes along with a higher ECI, a measure of overall calibration, defined as the average squared difference of the predicted probabilities with their grouped estimated observed probabilities [18]; and clearly worse testing for goodness-of-fit by the method of Hosmer and Lemeshow [21]. The HL method is based on dividing the sample up according to usually 10 groups of predicted probabilities and then compares the distribution to a chi-square distribution with a p-value > 0.2 usually being seen as an indication of fair calibration/goodness-of-fit [18, 21]. Of course – as is the goal of external validation – our external validation cohort represents a much more heterogenous population than the development cohort, now including European and Asian individuals, which explains some of the lack of generalization in terms of calibration. In the realm of CPMs, calibration might arguably carry a more significant role than discrimination alone [37]. This is because clinicians and patients are typically more concerned with predicted probabilities of a specific endpoint rather than a binary classification – individual patients, after all, are not binary, but carry a spectrum of expected risks and benefits [7]. Hence, insufficient calibration poses a significant obstacle to the clinical and external applicability of prediction models. Another potential explanation of the poor generalization in terms of calibration can be explained by different definitions of input variables: Although our data collection adhered strictly to the definitions provided by the Khor et al. [6] paper, institutional protocols and inter-rater assessment still varies. This is one of the general limitations of CPMs based on tabulated medical data: Because data must first undergo multiple stages of summarization and simplification by human healthcare providers, the overall predictive power can quickly reach “ceiling effects” due to the input heterogeneity. This is another reason why external validation is so crucial: To test out whether CPMs work just as well if applied in a real-world environment (effectiveness vs. efficacy). In the future, direct inclusion of source data (such as MRI) without human coding, or automated data collection through natural language processing, might somewhat alleviate this bottleneck [38].
Still, even if not perfectly calibrated in a rigorous external validation study, the models published by Khor et al. [6] are admirable and show good generalization overall, especially in terms of discrimination performance – no signs of overfitting can be observed here. Overfitting manifests as a relevant difference between training and testing performance in terms of discrimination [35]. It is common for out-of-sample performance to be comparable to or slightly worse than the training performance for a well-fitted model. The observed discrimination performance in our external validation study fits this norm well. It can be concluded that the SCOAP-CERTAIN model can safely be applied in clinical practice, although it must be kept in mind that predicted probabilities (calibration) should only be used as rough estimates, and that binary predictions – while generalizing well (discrimination) – still are no more accurate than an AUC of around 0.70.
In the end, in the realm of degenerative spine surgery, well-validated CPMs such as the SCOAP-CERTAIN [6] or FUSE-ML [10] models should only be used cautiously as rough estimates to offer an objective “second opinion” in the risk–benefit counseling of patients, but never as absolute red or green lights for surgical indications. We suggest that a future model should also be capable of predicting longer-term prognosis, as longer-term outcomes will improve the robustness of outcome data in lumbar patients. This could be achieved by incorporating more extended follow-up data and reducing short-term variability. These measures will lead to a more comprehensive understanding of patient trajectories, which is essential for effective clinical decision-making and enhanced calibration.
Additionally, it is crucial that future studies, as previously mentioned in the external preliminary stage, report key metrics such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the models assessed. Reporting these metrics will enable better differentiation and validation of the predicted values, thereby enhancing the reliability and applicability of clinical prediction models in practice.
Limitations
Regarding the primary surgical indication, our cohort showed mostly lumbar spinal stenosis, spondylolisthesis and discogenic low back pain whereas in Khor’s cohort radiculopathy was the leading diagnosis, followed also by stenosis and spondylolisthesis [6]. Of course, surgical indication and especially the chosen technique might vary between centers, which is exactly why multicenter external validation is important. Compared to the development cohort, we also included lateral techniques, which in turn brings a broader range of included patients. We apply a mixed cohort (FUSE-ML) of partially prospectively collected, and partially retrospectively collected data. It is known that the difference of these two strategies have a relevant influence on collected data – especially on complications, which is fortunately not a topic here – as well as on missingness, and could therefore affect final analysis, too [39]. Still, on the other hand, the fact that models still generalized relatively well on these heterogenous data is the point of external validation and even more so proves the robustness of the Khor et al. [6] models. Due to the lack of long-term (> 2 years follow-up) data, even with good calibration and discrimination performance, we are only able to predict short- and mid-term outcomes. More long-term data evaluation regarding CPMs is necessary. The validated models also do not predict surgical risks such as perioperative complications or long-term adjacent segment degeneration, information which would be particularly useful in risk–benefit discussions. The fact that FUSE-ML or SCOAP-CERTAIN models also are not able to provide prognosis of natural history or conservative treatment in these degenerative conditions means that they only provide half of the answer when making decisions on surgical versus conservative treatment strategies.
Conclusion
Utilizing data from a multinational registry, we externally validate the SCOAP-CERTAIN prediction tool. While the model demonstrated good discrimination, the calibration of predicted probabilities was only fair, necessitating caution in applying it in clinical practice. We propose that future CPMs consider predicting longer-term prognosis for this patient population, and emphasize the importance of rigorous external validation, robust calibration, as well as sound reporting.
References
Kepler CK et al (2014) National trends in the use of fusion techniques to treat degenerative spondylolisthesis. Spine 39(19):1584–1589. https://doi.org/10.1097/BRS.0000000000000486
Ivar Brox J et al (2003) Randomized clinical trial of lumbar instrumented fusion and cognitive intervention and exercises in patients with chronic low back pain and disc degeneration. Spine 28(17):1913–1921. https://doi.org/10.1097/01.BRS.0000083234.62751.7A
Fairbank J, Frost H, Wilson-MacDonald J, Yu L-M, Barker K, Collins R (2005) Randomised controlled trial to compare surgical stabilisation of the lumbar spine with an intensive rehabilitation programme for patients with chronic low back pain: the MRC spine stabilisation trial. BMJ 330(7502):1233. https://doi.org/10.1136/bmj.38441.620417.8F
Birkmeyer NJO et al (2002) Design of the spine patient outcomes research trial (SPORT). Spine 27(12):1361–1372. https://doi.org/10.1097/00007632-200206150-00020
Weinstein JN et al (2009) Surgical compared with nonoperative treatment for lumbar degenerative spondylolisthesis: four-year results in the spine patient outcomes research trial (SPORT) randomized and observational cohorts. J Bone Jt Surg-Am Vol 91(6):1295–1304. https://doi.org/10.2106/JBJS.H.00913
Khor S et al (2018) Development and validation of a prediction model for pain and functional outcomes after lumbar spine surgery. JAMA Surg 153(7):634. https://doi.org/10.1001/jamasurg.2018.0072
Riley RD et al (2016) External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. https://doi.org/10.1136/bmj.i3140
Staartjes VE, Kernbach JM (2020) Significance of external validation in clinical machine learning: let loose too early? Spine J Off J North Am Spine Soc 20(7):1159–1160. https://doi.org/10.1016/j.spinee.2020.02.016
Quddusi A et al (2020) External validation of a prediction model for pain and functional outcome after elective lumbar spinal fusion. Eur Spine J 29(2):374–383. https://doi.org/10.1007/s00586-019-06189-6
Staartjes VE et al (2022) FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease. Eur Spine J Off Publ Eur Spine Soc Eur Spinal Deform Soc Eur Sect Cerv Spine Res Soc 31(10):2629–2638. https://doi.org/10.1007/s00586-022-07135-9
Collins GS, Ogundimu EO, Altman DG (2016) Sample size considerations for the external validation of a multivariable prognostic model: a resampling study: sample size considerations for validating a prognostic model. Stat Med 35(2):214–226. https://doi.org/10.1002/sim.6787
Mannion AF et al (2022) Development of a mapping function (“crosswalk”) for the conversion of scores between the oswestry disability index (ODI) and the core outcome measures index (COMI). Eur Spine J 31(12):3337–3346. https://doi.org/10.1007/s00586-022-07434-1
Fairbank JCT, Pynsent PB (2000) The oswestry disability index. Spine 25(22):2940–2953. https://doi.org/10.1097/00007632-200011150-00017
Childs JD, Piva SR, Fritz JM (2005) Responsiveness of the numeric pain rating scale in patients with low back pain. Spine 30(11):1331–1334. https://doi.org/10.1097/01.brs.0000164099.92112.29
Fekete TF, Haschtmann D, Kleinstück FS, Porchet F, Jeszenszky D, Mannion AF (2016) What level of pain are patients happy to live with after surgery for lumbar degenerative disorders? Spine J 16(4):S12–S18. https://doi.org/10.1016/j.spinee.2016.01.180
Ostelo RWJG et al (2008) Interpreting change scores for pain and functional status in low back pain: towards international consensus regarding minimal important change. Spine 33(1):90–94. https://doi.org/10.1097/BRS.0b013e31815e3a10
Templ M, Kowarik A, Alfons A, Prantner B (2019) VIM: visualization and imputation of missing values. https://CRAN.R-project.org/package=VIM Accessed 5 Jan 2020
Staartjes V E, Regli L, Serra C (2022) Machine learning in clinical neuroscience: foundations and applications, In: Acta neurochirurgica supplement, vol 134. Cham: Springer International Publishing https://doi.org/10.1007/978-3-030-85292-4
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3. https://doi.org/10.1175/1520-0493(1950)078%3c0001:VOFEIT%3e2.0.CO;2
Van Hoorde K, Van Huffel S, Timmerman D, Bourne T, Van Calster B (2015) A spline-based tool to assess and visualize the calibration of multiclass risk predictions. J Biomed Inform 54:283–293. https://doi.org/10.1016/j.jbi.2014.12.016
Hosmer D W, Lemeshow S, Sturdivant R X (2013) Applied logistic regression, In: Wiley series in probability and statistics. Wiley https://doi.org/10.1002/9781118548387
R Core Team (2023) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria
Senders JT et al (2018) Machine learning and neurosurgical outcome prediction: a systematic review. World Neurosurg 109:476–486. https://doi.org/10.1016/j.wneu.2017.09.149
Ghogawala Z, Barker FG, Benzel EC (2016) Fusion surgery for lumbar spinal stenosis. N Engl J Med 375(6):600–601
Försth P et al (2016) A randomized, controlled trial of fusion surgery for lumbar spinal stenosis. N Engl J Med 374(15):1413–1423. https://doi.org/10.1056/NEJMoa1513721
Staartjes VE, Vergroesen P-PA, Zeilstra DJ, Schröder ML (2018) Identifying subsets of patients with single-level degenerative disc disease for lumbar fusion: the value of prognostic tests in surgical decision making. Spine J 18(4):558–566. https://doi.org/10.1016/j.spinee.2017.08.242
Mannion AF, Brox J-I, Fairbank JC (2016) Consensus at last! Long-term results of all randomized controlled trials show that fusion is no better than non-operative care in improving pain and disability in chronic low back pain. Spine J 16(5):588–590. https://doi.org/10.1016/j.spinee.2015.12.001
Willems P (2013) Decision making in surgical treatment of chronic low back pain: the performance of prognostic tests to select patients for lumbar spinal fusion. Acta Orthop 84(sup349):1–37. https://doi.org/10.3109/17453674.2012.753565
Van Hooff ML, Mannion AF, Staub LP, Ostelo RWJG, Fairbank JCT (2016) Determination of the oswestry disability index score equivalent to a “satisfactory symptom state” in patients undergoing surgery for degenerative disorders of the lumbar spine—a spine tango registry-based study. Spine J 16(10):1221–1230. https://doi.org/10.1016/j.spinee.2016.06.010
Falavigna A et al (2017) Current status of worldwide use of patient-reported outcome measures (PROMs) in spine care. World Neurosurg 108:328–335. https://doi.org/10.1016/j.wneu.2017.09.002
Kim JS et al (2018) Examining the ability of artificial neural networks machine learning models to accurately predict complications following posterior lumbar spine fusion. Spine 43(12):853–860. https://doi.org/10.1097/BRS.0000000000002442
Ehlers AP et al (2017) Improved risk prediction following surgery using machine learning algorithms. EGEMs Gener Evid Methods Improve Patient Outcomes 5(2):3. https://doi.org/10.13063/2327-9214.1278
Mattei TA, Rehman AA, Teles AR, Aldag JC, Dinh DH, McCall TD (2017) The ‘lumbar fusion outcome score’ (LUFOS): a new practical and surgically oriented grading system for preoperative prediction of surgical outcomes after lumbar spinal fusion in patients with degenerative disc disease and refractory chronic axial low back pain. Neurosurg Rev 40(1):67–81. https://doi.org/10.1007/s10143-016-0751-6
Steinmetz MP, Mroz T (2018) Value of adding predictive clinical decision tools to spine surgery. JAMA Surg. https://doi.org/10.1001/jamasurg.2018.0078
Kernbach JM, Staartjes VE (2022) Foundations of machine learning-based clinical prediction modeling: part II—generalization and overfitting. machine learning in clinical neuroscience. In: Staartjes VE, Regli L, Serra C (eds) Acta neurochirurgica supplement, vol 134. Springer International Publishing, Cham, pp 15–21. https://doi.org/10.1007/978-3-030-85292-4_3
Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 350(jan07 4):g7594–g7594. https://doi.org/10.1136/bmj.g7594
Staartjes VE, Kernbach JM (2020) Importance of calibration assessment in machine learning-based predictive analytics. J Neurosurg Spine 32(6):985–987
Staartjes VE, Stienen MN (2019) Data mining in spine surgery: leveraging electronic health records for machine learning and clinical research. Neurospine 16(4):654–656. https://doi.org/10.14245/ns.1938434.217
Nagurney JT (2005) The accuracy and completeness of data collected by prospective and retrospective methods. Acad Emerg Med 12(9):884–895. https://doi.org/10.1197/j.aem.2005.04.021
Acknowledgements
We thank the patients whose anonymized data were used for this research.
Funding
Open access funding provided by University of Zurich. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that the article and its content were composed in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Grob, A., Rohr, J., Stumpo, V. et al. Multicenter external validation of prediction models for clinical outcomes after spinal fusion for lumbar degenerative disease. Eur Spine J 33, 3534–3544 (2024). https://doi.org/10.1007/s00586-024-08395-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00586-024-08395-3