Background

Esophageal cancer (EC) is the seventh most common cancer cause of death in male population worldwide [1]. China accounts for more than half of the world’s new cases and EC-related deaths, with more than 90% diagnosed EC being esophageal squamous cell carcinoma (ESCC) [2]. Surgery, chemotherapy and radiotherapy are the cornerstone treatments of EC [3, 4]. However, outcomes are still poor with a 5-year survival rate of 10–15% [5]. The emerging targeted drugs used to treat EC are only targeting HER2 or vascular endothelial growth factor [68], and the therapeutic effect of improved traditional treatments with added targeted drugs is still unsatisfactory with a 5-year survival rate of 30–40% for ESCC [9]. Therefore, there is a high clinical need for novel and more effective treatment options for EC patients.

In recent years, the study of KEYNOTE-028 and KEYNOTE-180 first confirmed the efficacy and safety of pablizumab in the treatment of advanced EC [10, 11]. Whereafter, in a larger sample size, KEYNOTE-181 established the position of pablizumab in the treatment of advanced EC [12]. At present, a number of studies have been performed to explore the efficacy and safety of immunotherapy combined with chemotherapy as first-line and post-line treatment of advanced ESCC [1316]. The comprehensive positive score (CPS), tumor proportion score (TPS) are immunohistochemical markers for evaluating the expression of programmed death receptor ligand 1 (PD-L1) in tumors. However, the precision of these biomarkers was unsatisfying. Therefore, more reliable biomarkers for predicting the efficacy of immunotherapy for EC is in urgent need.

With the rapid development of artificial intelligence (AI) in the field of medical imaging, radiographic characteristics of tumors referred to as ‘radiomics’ have shown success in immunotherapeutic response prediction for different tumor types [1719]. To the best of our knowledge, there is no evidence yet in EC. In this study, we aimed to evaluate the potential predictive value of CT-derived radiomics in advanced ESCC patients receiving immune-checkpoint inhibitor plus chemotherapy (ICI + CT).

Methods

Study design

A total of 64 patients with advance inoperable ESCC receiving 200 mg every 3 weeks of Sintilimab plus Docetaxel (60 mg/m2) and Carboplatin (AUC = 5) at two centers between January 2019 and June 2020 were included in this study approved by the two institutional review boards. All patients in two centers had undergone MDT study before starting treatment. Informed consent was waived. In this study patients were confirmed by biopsy and immunohistochemistry of the original tumor tissue. All the enrolled patients were first-visit and prior to treatment. Patients who had never received cancer related treatments including radiotherapy, chemotherapy, comprehensive treatment and surgery and those who lacked CT imaging data and necessary clinical information before the initial treatment (immunotherapy plus chemotherapy) were excluded from this study. Exclusion criteria also included patients with non-squamous cell carcinoma including adenocarcinoma and signet ring cell carcinoma and those who discontinued treatment due to adverse events. Flow chart of patient enrollment is shown in Fig. 1. For patients’ clinical characteristics, information of age, gender, Body Mass Index (BMI), clinical TNM stage, hemoglobin, blood albumin, leucocyte, C-reactive protein and underlying diseases was acquired from electronic medical record system. BMI was calculated based on height and weight. Clinical TNM stage was confirmed by pre-treatment gastroscopy, CT examination, etc.

Fig. 1
figure 1

Flowchart of Patient enrollment

Response kinetics and scan protocol

Contrast enhanced computed tomography (CE-CT) scans were acquired before (baseline) and around six weeks (two cycles) after start of treatment (follow-up). Treatment response was evaluated by assessing the relative change in diameter between baseline and follow-up, using RECIST 1.1 criteria [20]. Patients were divided into responders [complete/partial response disease] and non-responders [stable and progressive disease] according to RECIST. For progressive disease, pseudoprogression was confirmed by follow-up observation.

All preoperative enhanced CT images were obtained with multidetector CT scanners during inspiration. Detailed information of the CT scanners including manufacturer, country of origin, tube voltage, slice thickness and spacing was shown in Supplementary Table 1. Iopromide (300 mg I/m1, Schering Pharmaceutical Ltd) was used as the contrast agent for enhanced scanning protocol, and 80–100 ml was injected at 3–4 ml/s flow rate.

Lesion segmentation and Radiomics features extraction

All enhanced CT images were manually segmented with an open-source software ITK-SNAP (http://www.itksnap.org/pmwiki/pmwiki.php) for feature extraction. 2D ROI was selected as the slice with maximum axial diameter of the tumor, and 3D ROI was segmented slice by slice on the whole volume of the lesions.

To correct variability from spatial information in three axes (x, y, z) and different CT protocols, all enrolled CT images were resampled to a same isotropic voxel spacing. Considering the distribution of our data, we resampled the 2D ROIs to 1 × 1 mm2, and the 3D ROIs to 1 × 1 × 1 mm3 to balance between the loss of in-plane information and the interpolation of out-of-plane information. Afterwards, the CT radiomics features, from 2D and 3D ROIs respectively, were extracted with an open-source python platform Pyradiomics (version 2.1.2, https://pyradiomics.readthedocs.io/en/latest/#). Features used in this study included 14 shape-based features, 18 first order statistics features and 68 texture features containing the gray-level co-occurrence matrix (GLCM, 22 features), gray level run length matrix (GLRLM, 16 features), gray level size zone matrix (GLSZM, 16 features) and gray level dependence matrix (GLDM, 14 features). Besides the original images, eight filters were also generated for feature extraction, including wavelet transform filter. All the categories of features other than shape originated from the original and filtered images were calculated. Therefore, in this study, a total of (18 + 68 + 14) + (18 + 68) *8 = 788 features were statistically analyzed.

To control the potential bias caused by various imaging acquisition protocols on the prediction efficacy of the model, ComBat correction method (https://github.com/Jfortin1/ComBatHarmonization) was applied to 2D and 3D ROIs, resulting in four different groups of features for comparison: (1) 2D uncorrected radiomics features; (2) 2D corrected radiomics features; (3) 3D uncorrected radiomics features; (4) 3D corrected radiomics features.

Feature selection

Feature selection was performed separately for each group of features. Three steps were applied to reduce dimensionality: (1) features with variance larger than 0.8 were included for further analysis; (2) univariate feature selection was done by ANOVA (continuous variable) or chi-square test (discrete variable) to explore the associations between features and treatment response. The features with p value>0.05 would be excluded from further analysis; (3) the most significant features were selected by the least absolute shrinkage and selection operator (LASSO) method. Since the total patient number was limited, the nonzero feature coefficients ranking the first five were selected for each group to avoid overfitting.

Prediction models and workflow

After feature selection, traditional machine learning algorithms, including support vector machine (SVM), k nearest neighbors, random forest, decision tree (DT), logistic regression (LR), were applied to build prediction radiomics models for each feature group. The performance of the models was compared by using 5-fold cross-validation in the validation cohort, with the best model being selected. All the patients were randomly split into 80% for training and the remaining 20% for validation, with 100 iterations. All feature selection and radiomics algorithm selection were based on the data in the training dataset to ensure independence from validation dataset.

Radiomic nomogram was built based on the multivariable logistic analysis of the selected radiomics features in the training group. Calibration curves accompanied by the Hosmer–Lemeshow test were plotted to evaluate the effectiveness of the radiomics nomogram. Decision curve analysis was conducted to determine the clinical usefulness of the radiomics nomogram by quantifying the net benefits at different threshold probabilities in the validation dataset. Flow chart of radiomics nomogram building was illustrated in Fig. 2.

Fig. 2
figure 2

Flowchart of feature selection and radiomics nomogram building. (A) Lesion segmentation and 2D ROI and 3D ROI segmentation; (B) A total of 788 selected features for 2D and 3D ROI respectively; (C) ComBat correction was applied to minimize the potential bias on the results due to different scan protocols of the 5 different CT scanners; (D) Dimension reduction for features selection; (E) Select the optimal algorithm for radiomics model building. The best one was selected by using 5-fold cross-validation in the validation cohort. All the patients were randomly split into 80% for training and the remaining 20% for validation, with 100 iterations; (F) Radiomic nomogram was built on the optimal algorithm. Calibration curves and decision curves were used to evaluate the effectiveness of the radiomics nomogram

Statistical analysis

Statistical analyses were performed by using SPSS 22.0 (IBM, USA). Variables were described as frequency (n%). The chi-square test was used to compare patients’ basic information between groups (responders versus non-responders) and P < 0.05 was considered statistically significant. All machine learning analyses were performed by using the Python package scikit-learn (0.19.0), and statistical plots were generated by R software (3.6.1, http://www.R-project.org). Area under the Receiver-Operating Characteristic Curves (AUCs) were calculated to evaluate the performance of the algorithms for each model, and the Youden Index was used to generate the optimal threshold to convert probabilities into binarized labels. Statistical metrics, including accuracy, sensitivity, specificity, NPV (Negative Predictive Value), PPV (Positive Predictive Value) and AUC were also calculated to evaluate the performance of the ultimate selected algorithm in the training cohort and the validation cohort for the different radiomics models. Wilcoxon rank test with Bonferroni correction was applied for multiple comparisons, and p < 0.0125 was considered statistically significant.

Results

Basic Clinicopathological characteristics

A total of 64 patients were included in our study, including 32 (50%) responders and 32 (50%) non-responders. Patients’ clinicopathological characteristics were given in Table 1. No significant difference was observed in underlying diseases between non-responders and responders both in the training and validation cohorts, with P value>0.05 respectively.

Table 1 Clinicopathological characteristics of advanced ESCC patients treated with ICI + CT

Features and optimal Radiomics algorithm selection

For the four different radiomics models including 2D uncorrected, 2D corrected, 3D uncorrected and 3D corrected models, feature selections were performed respectively, and the selected features and their descriptions were shown in Table 2. Algorithms of SVM, KN, RF, and LR were applied to build radiomics models for 2D and 3D ROIs by using selected features from the training cohort, and their performances were compared. The results showed that relatively higher AUC (0.804, 95% CI: 0.800–0.822) could be obtained by using SVM algorithm for the training dataset (Supplementary Table 2). Finally, SVM with the best performance was selected for further evaluation of the performance of radiomics models.

Table 2 Selected features of the four different models

Radiomics models performance based on SVM algorithm

To evaluate the performance of our models in classifying patients according to their treatment response, we used the SVM algorithm. Good performance of the four different radiomics models using SVM algorithm was observed for the probability of responders (Table 3). The results showed that the 2D corrected radiomics model yielded the optimal performance with an AUC of 0.818 [95% CI, 0.797–0.829], an accuracy of 80.4% (95% CI, 79.3–81.5%), a sensitivity of 72.7% (95% CI, 70.6–74.2%), a specificity of 88.6% (95% CI, 85.5–90.0%), a NPV of 79.5% (95% CI, 78.4–80.3%), a PPV of 91.7% (95% CI, 89.6–92.5%) in the training cohort, and an AUC of 0.787 [95% CI, 0.752–0.806], an accuracy of 79.6% (95% CI, 77.0–80.6%), a sensitivity of 71.4% (95% CI, 67.3–76.7%), a specificity of 87.2% (95% CI, 84.1–90.1%), a NPV of 75.3% (95% CI, 72.1–78.6%), a PPV of 84.8% (95% CI, 81.3–87.5%) in the validation cohort.

Table 3 Performance evaluation of the radiomic models using SVM algorithm in the training and validation cohort

The performance of the four different radiomics models was compared by AUCs as shown in Fig.3. The 2D models outperformed the 3D models (2D uncorrected vs. 3D uncorrected, p < 0.0001; 2D corrected vs. 3D corrected, p < 0.0001) in the training cohort, which was confirmed in the validation cohort (2D uncorrected vs. 3D uncorrected, p < 0.0001; 2D corrected vs. 3D corrected, p < 0.0001). When the ComBat correction was used, the performance of 2D models was better (p = 0.0059) in the training cohort, and significantly better (p < 0.0001) in the validation cohort. There was no improvement for 3D models when integrated with the ComBat correction (training cohort, p = 0.17; validation cohort, p = 0.018).

Fig. 3
figure 3

Comparison of AUCs between the four different radiomics models based on SVM algorithm in the training and validation cohort

Development, performance and validation of individualized Radiomics nomogram

Quantitative nomograms for predicting the probability of responders were constructed separately for the four groups of features, of which the one based on the 2D corrected model is shown in Fig.4 (A).

Fig. 4
figure 4

Development and performance of the radiomics nomogram. (A) Nomogram based on the 2D corrected radiomics features. (B) Calibration curves of the nomograms built on 3D uncorrected, 3D corrected, 2D uncorrected 2D corrected radiomics features in the training cohort. (C) Calibration curves of the nomograms built on 3D uncorrected, 3D corrected, 2D uncorrected 2D corrected radiomics features in the validation cohort. The calibration curves suggesting the perfect match between the actual (Y-axis) and nomogram-predicted (X-axis) responders. (D) Decision curves showed relatively good performance for the models in terms of clinical application and indicated that all the models added more benefit than either the treat-all or treat-none scheme within the threshold between 30 and 60%. Moreover, the 2D corrected model achieved the highest benefit if the threshold probability of a patient was between 50 and 70%

The calibration curves of the four different radiomics models [Fig. 4 (B) & (C)] estimating the probability of responders demonstrated good agreement between prediction and observation in the training cohort and validation cohort. For the 2D corrected radiomics model, the Hosmer–Lemeshow test yielded a nonsignificant P value of 0.160 in the training cohort and 0.478 in the validation cohort, suggesting the perfect match between the actual (Y-axis) and nomogram-predicted (X-axis) responders. The 2D corrected model also achieved good discrimination performance with AUC of 0.843 (95% CI, 0.736–0.950) within the training cohort and 0.914 (95% CI, 0.775–1.000) in the validation cohort (Table 4). For 2D uncorrected model, 3D corrected model and 3D uncorrected model, AUCs were 0.794 (95% CI, 0.666–0.921), 0.658 (95% CI, 0.502–0.813) and 0.662 (95% CI, 0.509–0.816) within the training cohort and 0.898 (95% CI, 0.721–1.000), 0.670 (95% CI, 0.511–0.849) and 0.677 (95% CI, 0.499–0.850) in the validation cohort, respectively.

Table 4 Performance evaluation of the nomogram

Clinical use

The decision curve was used to compare the benefit of the four different radiomics nomogram, treat-all and treat-none scheme, as shown in Fig. 4 (D). The results showed relatively good performance for the models in terms of clinical application and indicated that all the models added more benefit than either the treat-all or treat-none scheme within the threshold between 30 and 60%. Moreover, the 2D corrected model achieved the highest benefit if the threshold probability of a patient was between 50 and 70%.

Discussion

This study aimed to evaluate the prediction efficacy of pre-therapeutic CT imaging based radiomics models in treatment response of patients with advanced ESCC receiving anti-PD-1 antibodies plus chemotherapy. In the new era of artificial intelligence (AI), radiographic characteristics automatically calculated by computer is more objective and makes more accurate quantitative analysis possible [2124]. Our study is the first attempt to predict treatment efficacy of ICI + CT in advanced ESCC prior to treatment using CT radiomics model. The quantitative approach has the potential to identify the responders before treatment.

Due to the long-time debate on whether to use one-slice 2D annotation or whole-volume 3D annotation especially for advanced cancer [2527], in our study, the comparison between 2D and 3D radiomic features was also performed. We found that 2D radiomic features significantly outperformed 3D features, which was similar to the reported [2627]. In a multicenter study of advanced gastric cancer [27], the performances of 2D and 3D CT radiomic features were compared in discriminating lymph node metastasis, lymphovascular invasion as well as pT stages’ classification. They found that 2D model outperformed 3D model with higher AUCs regarding the above three tasks despite different resampling spacings. Similar findings were also reported by another study [26] in which the prognostic prediction performances were compared between 2D and 3D CT radiomics features in patients with non-small cell lung cancer (NSCLC). They found that 2D Cox model had a higher C-index compared with 3D Cox model. The results of our study showed that 2D models performed significantly better than 3D ones, which might be attributed to more noise of 3D ROIs originated from multi-slice manual annotations and inconsistent resolutions of the transverse plane and z-plane [2627]. Therefore, in this scenario, 2D models are recommended in ESCC radiomics researches for the better performance and time-saving annotations.

The ComBat function compensation method is a data-driven method correcting for differences in features caused by the various imaging protocols [2829]. ComBat correction was applied in our study to control potential bias on the results caused by different CT scanning schemes such as tube voltage, reconstruction kernel, slice thickness, and in-plane resolution. This method showed efficiency in 2D models by standardizing the CT images obtained from different CT scanners, and achieved the highest AUCs in both training and validation cohort. In addition, higher net benefits could be obtained with ComBat correction in decision curve analysis, thus patients could benefit from treatment optimization and avoid unnecessary risks.

In this study, the proposed radiomics model provided potential clinical utility from the following perspectives. For patients with advanced unresectable esophageal cancer, the established radiomics model could screen out the potential responders to ICI + CT prior to treatment which would improve effeacy. On the other hand, due to the high cost of immunotherapy, preselecting the potential responders prior to treatment could reduce the economic burden to patients and maximize their benefits, which was particularly important in developing countries. In addition, as a non-invasive biomarker, CT imaging could overcome the problem of tumor heterogeneity. Some other indicators, such as PD-L1 expression, obtained by fine needle aspiration biopsy could not represent its real status in the whole tumor tissue, so the detection results might be biased due to tumor heterogeneity. Finally, in our study, we recommended 2D radiomics features because one-slice 2D annotation was a much more time-saving data processing with significantly higher prediction efficacy than that of whole-volume 3D annotation.

This study has several limitations. First, our findings deserve further extra external validation with larger sample size and inclusion of other medical centers. A large-scale study enrolling more patients is deserved and may definitely help validate and improve its applicability as an effective prediction tool for assisting treatment decision making. Second, due to the limited patient number of other histologic types of EC in our center, adenocarcinoma and signet ring cell carcinoma, therefore, were not included in the present study. This limits the application of the built model to some extent. Third, due to the limited spatial resolution of CT, there may be bias in the determination of the boundary between the lesion and the normal esophageal tissue when conducting ROIs segmentation.

Conclusions

In conclusion, the proposed CT-based radiomics model performs well and thereby is expected to serve as an alternative tool to select the potential best responders to ICI + CT prior to treatment for patients with ESCC, thus can assist treatment decision making in the clinical setting.