Introduction

The use of automated medical image analysis by AI-based algorithms has generated great enthusiasm: world-class radiological evaluations may become frequently available for low-income countries, rural areas, or physicians in training [1]. Moreover, the automated evaluation of images may help radiologists in managing the increasing workload demands [2]. Algorithms for medical image analysis are developed either by using hand-crafted image features (extracted automatically or by human readers) that are analyzed by machine learning algorithms or by using deep learning techniques that do not require prior feature extraction [1]. Such algorithms have already shown great diagnostic performance comparable to human expert readers in some areas [3]. However, a recent survey among the members of the American College of Radiology and the Radiological Society of North America showed that very few physicians use such imaging algorithms in their practice (about 30%, mainly for research purposes) and that among those, 93% reported inconsistent results of these algorithms in practice. About 95% said they would not put their faith into a diagnosis solely made by an algorithm (although some of them have FDA clearance) [4]. The discrepancy between the excellent performance reported by newly developed imaging algorithms and their non-use in clinical practice as well as the reluctance expressed by human imaging experts seems striking. An explanation for this may be that algorithms which are trained on image data alone may perform on par with human image readers when looking only at those images — but this does not represent the clinical reality in which imaging information (of multiple imaging modalities) is often considered alongside contextualizing clinical and demographic information.

Taking breast cancer diagnosis as an example, several imaging modalities (usually ultrasound and mammography, sometimes MRI) are used to evaluate indeterminate breast masses in combination with clinical and demographic information like patient age, suspicious palpability, disease history, and family medical history [5, 6]. Especially breast ultrasound has been under intense evaluation over the past years as it showed potential to identify cancers that are initially missed in mammography but ultrasound also leads to more false-positive findings [7]. The absent integration of contextualizing clinical and demographic information and of different imaging modalities into AI-based, diagnostic algorithms (especially in breast cancer diagnosis) may restrict the current performance of those diagnostic models. Although this knowledge gap has important implications for clinical practice, it has not been addressed systematically yet.

In this study, we compared the diagnostic performance of routine breast cancer diagnosis to breast ultrasound interpretations by humans or AI-based algorithms which were trained either on unimodal information (ultrasound features) or on multi-modal information (clinical and demographic information in addition to ultrasound features) to classify breast masses. We hypothesized that both humans and AI-based algorithms can improve their performance when considering multi-modal instead of unimodal information. For our analysis, we used data of an international multicenter trial that evaluated the use of a new ultrasound technique compared to traditional B-mode breast ultrasound [8].

Material and methods

Patient recruitment and selection

Patients were recruited as part of an international multicenter trial (NCT02638935). The trial was conducted at 12 study sites across 7 countries (Austria, France, Germany, Japan, Portugal, The Netherlands, the USA) from February 2016 to March 2019. Women aged 18 years or older who presented with an indeterminate breast mass ≥ 0.5 and ≤ 5 cm in largest diameter size in 2D B-mode ultrasound were enrolled. Only one mass per patient was included. As by requirement of the parental trial, all patients underwent histopathological confirmation.

Design and definitions

In the clinical routine, a breast mass was classified as (potentially) benign or malignant after evaluating different imaging modalities (mammography, 2D B-mode ultrasound, and/or MRI, as applicable in clinical routine) alongside additional demographic and clinical information about the patients’ age, disease history, and family medical history. Three physicians specialized in ultrasound diagnosis from separate study sites performed a second read of all ultrasound images, without access to any clinical information on patients. The three ultrasound experts, who had 10 to 30 years of experience in breast cancer diagnosis, consisted of one radiology professor, one professor specialized in breast diagnosis (head of breast diagnosis), and one senior physician specialized in breast diagnosis (head of breast diagnosis).

The risk of malignancy was evaluated according to the American College of Radiology (ACR) BI-RADS criteria and a BI-RADS score was assigned for all patients in the clinical routine and by the ultrasound experts. BI-RADS assigns risk categories to breast masses: BI-RADS III is assigned for patients with a risk of malignancy > 0% but ≤ 2%, BI-RADS IV for > 2% but ≤ 95%, and BI-RADS V for > 95% risk of malignancy. To further refine this broad risk assessment, a continuous likelihood score of malignancy was assigned for all patients in addition to the BI-RADS score. Of the single variables that were considered to evaluate the risk of malignancy in the clinical routine, the single BI-RADS descriptors of the ultrasound evaluation, patient age, and palpability of the lesion were specifically documented for this trial.

For comparison, we developed and validated two machine learning (ML) algorithms trained on unimodal information (ultrasound features generated by the ultrasound experts, see Table 1) to classify breast masses. The same ML algorithms were subsequently trained on multi-modal information (clinical and demographic information in addition to ultrasound features). The full list of variables is shown in Table 1.

Table 1 Distribution of baseline and outcomes variables in the whole cohort and in the development and validation datasets

Following ACR BI-RADS guidelines, we assumed breast masses to be malignant when the risk of malignancy was equal to or above 2% according to BI-RADS 4 or 5. All patients underwent histopathologic confirmation against which the diagnostic predictions were compared.

Algorithm development

Choice of algorithms and reporting on them were informed by guidelines on how to use ML in medicine [9], how to report findings of diagnostic tests [10], and multivariate prediction models [11] as well as previously published research by our group [12,13,14,15] We developed and validated two algorithms to predict malignancy of a breast mass:

  1. 1.

    Logistic regression (LR) with elastic net penalty: We chose this algorithm because of its ability to attenuate the influence of certain predictors on the model, leading to greater generalizability to new datasets [16, 17]. This algorithm is limited to identifying linear relations between the predictor variables and the outcome.

  2. 2.

    Extreme gradient boosting (XGBoost) tree: Gradient boosting refers to a machine learning technique in which the final prediction model consists of an ensemble of several, stepwise built models [18]. Gradient boosting is commonly applied to decision trees which results in an ensemble model combining the prediction of several trees. We chose this algorithm because of its ability to identify more complex, non-linear patterns in data while still being interpretable.

Algorithms were trained and tuned on the development set using tenfold cross-validation; a hypergrid-search was used for hyperparameter tuning (see Supplementary Appendix for optimal hyperparameters, the results of the internal testing, and data preparation steps). The final model was then externally validated using an independent dataset. As this was a large international multicenter trial, we selected one trial site as an independent validation dataset on which the final model was (externally) validated. Guidelines for multivariable risk prediction models recommend validation of such a model in a dataset of at least 100 events [11]. Only one trial site had at least 100 events and was thus used as validation set (study site 1 of the parental trial) [8]. The other 11 trial sites were used as a development set.

We provide a more detailed description of all algorithms and the algorithm development as well as a detailed evaluation of our study according to the abovementioned guidelines [9,10,11]. in the online Supplementary Appendix.

Statistical analysis

Descriptive statistics including absolute and relative frequencies as well as chi-square tests for categorical data and mean and standard deviation were used alongside t-tests for continuous data to compare the distribution of baseline and outcome variables in the development and validation sets.

To assess the diagnostic performance in classifying breast masses of the clinical routine, the ultrasound experts, the unimodal and multi-modal ultrasound ML algorithms, area under the receiver-operating characteristics curve (AUC), and accompanying 95% confidence intervals were calculated for every model using 2000 bootstrap replicates that were drawn from the validation dataset and stratified for the outcome variable (malignant/benign). We conducted subgroups analyses to compare the AUC of the unimodal and multi-modal ultrasound ML models in the external validation set across different age groups (< 50 years, ≥ 50 years) and across different histopathologic subgroups (malignant vs. benign).

Additionally, we compared sensitivity, specificity, and negative- and positive-predictive values to the gold standard of histopathologic evaluation and against each other; we computed 95% Clopper-Pearson confidence intervals.

Calibration of the ML models was evaluated using calibration plots (observed vs. predicted probabilities [19]) and Spiegelhalter’s Z statistic [20].

No multiplicity adjustments against type-I-error inflation were performed. Thus, these analyses are of descriptive nature. All p values must be interpreted descriptively and have no confirmatory value. Analysis was conducted using R software, Version 3.6.1 (the “caret” package of R was used for the model development).

Ethical considerations

The trial was approved by all respective ethical committees and all participants gave their written informed consent. The research reported in this article complies with the Declaration of Helsinki.

Results

Patient recruitment

A total of 1294 women were enrolled. Six were excluded from the analysis because no data on the pathologic evaluation was available. The remaining 1288 underwent full clinical breast evaluations including clinical examinations and multi-modal imaging (B-mode breast ultrasound, mammography, MRI as applicable in clinical routine) followed by histopathologic evaluation of the mass.

Baseline demographic and clinical characteristics

Table 1 shows the distribution of baseline and outcome variables for the whole cohort and in the development and external validation datasets that were used for the algorithm development and validation. In the whole cohort, the mean age was 46 years (standard deviation 16.0) and 368 of 1288 breast masses (28.6%) were malignant as confirmed by histopathology. When comparing the development (n = 915) and the external validation (n = 373) datasets, the validation set had a significantly higher proportion of histopathologically malignant masses (33.8% vs. 26.4%, p = 0.010) and a higher proportion of masses with clinically suspicious palpability (63.5% vs. 45.9%, p < 0.001), as well as some variations in the tissue composition, mass margins, echo pattern, and posterior features in the B-mode ultrasound images.

Diagnostic performance evaluation

Diagnostic performance metrics of the clinical routine, of the three ultrasound experts, of the unimodal ultrasound ML algorithms, and of the multi-modal ultrasound ML algorithms in the validation set are shown in Table 2. AUROC in the clinical routine was 0.95 (95% CI 0.93 to 0.97), for the multi-modal ultrasound LR with elastic net penalty algorithm 0.90 (95% CI 0.87 to 0.93), for the multi-modal ultrasound XGBoost tree algorithm 0.89 (95% CI 0.85 to 0.92), for the unimodal ultrasound LR with elastic net penalty algorithm 0.83 (95% CI 0.78 to 0.87), for the unimodal ultrasound XGBoost tree algorithm 0.82 (95% CI 0.77 to 0.86), and 0.82 (95% CI 0.77 to 0.87), 0.82 (95% CI 0.77 to 0.87), and 0.84 (95% CI 0.79 to 0.89) for the ultrasound experts 1–3, respectively.

Table 2 Diagnostic performance of routine clinical breast diagnosis, the three ultrasound experts, the unimodal ultrasound machine learning algorithms, and the multi-modal ultrasound machine learning algorithms in the validation set

Figure 1 summarizes the comparison in diagnostic performance between the different approaches: the performance of the unimodal ultrasound ML algorithms did not differ significantly from the performance of the ultrasound experts (p 0.361 to 0.935); the multi-modal ultrasound ML algorithms performed significantly better compared to all ultrasound experts and the unimodal algorithms (all p < 0.001); the clinical routine breast diagnosis performed significantly better compared to all other approaches (all p < 0.01). Corresponding ROC curves are illustrated in Fig. 2.

Fig. 1
figure 1

Performance comparison between the clinical routine, the ultrasound experts, the unimodal machine learning algorithms, and the multi-modal machine learning algorithms

Fig. 2
figure 2

Receiver operating characteristic curves of the clinical routine, the ultrasound experts, the unimodal machine learning algorithms, and the multi-modal machine learning algorithms

Calibration plots of the ML models are illustrated in Supplemental Fig. 1 and indicate well-calibrated models, which was confirmed by Spiegelhalter’s Z (good calibration in 3 out of 4 models, see Supplementary Appendix). The unimodal LR with elastic net penalty showed an impaired calibration for mid-range probabilities of malignancy.

Insights into model predictions and traditional multivariable logistic regression

Predictive coefficients of the unimodal and multi-modal LR with elastic net penalty are listed in Table 3. For the multi-modal algorithm, patient age was the most important predictor of malignancy (regularized ß = 7.60, 95% CI 7.53 to 7.73), followed by spiculated margins (regularized ß = 1.10, 95% CI 0.21 to 1.99), a not parallel orientation of the mass (regularized ß = 0.88, 95% CI 0.52 to 1.24), clinically suspicious palpability (regularized ß = 0.84, 95% CI 0.56 to 1.28), and an irregular shape (regularized ß = 0.54, 95% CI 0.0 to 1.08).

Table 3 Predictive coefficients of the uni- and multi-modal logistic regression with elastic net penalty algorithms

Figure 3 illustrates insights into the variable importance for the predictions made by the XGBoost tree via Shapley Additive Explanations (SHAP) values.

Fig. 3
figure 3

Ultrasound Images. a This patient’s ultrasound images were evaluated to show a benign breast mass by the three ultrasound experts but to show a malignant breast mass by full clinical breast evaluation. This patient was 41 years old with a positive family history for breast cancer and a clinically suspicious palpable tumor. Histopathology showed a luminal B, NST, G3 carcinoma. b This patient’s ultrasound images were evaluated to show a benign breast mass by the three physician experts and by full clinical breast evaluation. This patient was 25 years old without any clinically suspicious signs. Histopathology showed a fibroadenoma

For comparison, odds ratios of a traditional multivariable logistic regression are listed in Table 4.

Table 4 Traditional multivariable logistic regression

Subgroup analyses

We evaluated the diagnostic performance of the multi-modal ultrasound ML models in the external validation set across different patient subgroups (see Supplemental Table 1). The algorithms performed equally well across different age groups (< 50 years and ≥ 50 years, p > 0.05). The algorithms showed higher performance among patients with benign compared to malignant histopathology (p < 0.05). Detailed AUC values are listed in Supplemental Table 1.

Further analyses

Table 5 shows the diagnostic performance of the clinical routine and of the three ultrasound experts in the whole cohort of 1288 patients. Their performance in the whole cohort and in the validation set did not differ significantly.

Table 5 Diagnostic performance of the clinical routine and of the three ultrasound experts in the whole cohort (n = 1288)

Exemplary images

Ultrasound images of two exemplary patients are shown in Fig. 4 and illustrate the importance of contextualizing demographic and clinical patient information.

Fig. 4
figure 4

Shapley Additive Explanations (SHAP) Value Summary Plot of the Extreme Gradient Boosting (XGBoost) Tree Model. a XGboost – unimodal Algorithm. SHAP values on the left side of the x-axis indicate that the variable was important for predicting malignancy; values on the right side indicate that the variable was important for predicting a benign breast mass. Purple indicates a high variable value (e.g. margin – non-circumscribed: yes); yellow indicates a low variable value (e.g age: lower patient age). The values on the y-axis represent the overall global variable importance. b XGboost – multi-modal Algorithm

Discussion

In this study, we compared the diagnostic performance of routine breast cancer diagnosis to breast ultrasound interpretations by humans or AI-based algorithms which were trained either on unimodal information (ultrasound features) or on multi-modal information (clinical and demographic information in addition to ultrasound features) to classify breast masses. Our classification algorithms showed equivalent or better performance compared to human readers in the classification of breast masses on ultrasound images. We show that beyond-human performance on imaging classification tasks does not necessarily yield state-of-the-art diagnostic decisions when compared to physicians who can evaluate multiple imaging sources alongside other relevant demographic and clinical information. We demonstrate that AI-based algorithms, like humans, can improve diagnostic accuracy of breast cancer classification by considering image data in combination with data on individuals’ demographics and clinical status. Contextualizing clinical and demographic information is a key element in the diagnostic pathway — even when imaging interpretation is optimized or enhanced by AI-based algorithms, there is an inherent limitation in accuracy of using only one imaging modality for breast cancer diagnosis. Further work is warranted to develop and evaluate individualized diagnostic models which combine imaging with comprehensive clinical and demographic data to better represent the diagnostic pathway of routine clinical breast diagnosis.

In interpreting our findings, some points should be further discussed. First, even the most AI-based imaging algorithms might be limited when evaluating only images of one imaging modality without contextualizing clinical and demographic information. This becomes evident when looking at the two exemplary patients whose ultrasound images are illustrated in Fig. 4. Moreover, a recent systematic review on AI-based image analysis identified 9 studies in the field of breast imaging [3]. Of these 9 studies, all reported that the developed algorithm showed a diagnostic performance comparable to that of human experts but all compared the performance against human image readers and not against full diagnostic evaluations in the clinical routine, all algorithms were trained on unimodal imaging information (7 ultrasound, 2 mammography), and only 3 were externally validated [21,22,23,24,25,26,27,28,29]. Analyzing contextualizing patient information for complex risk assessments by AI-based algorithms has yielded promising results in other fields [13,14,15] Thus, the absent integration of contextualizing clinical and demographic information and of different imaging modalities into AI-based, diagnostic algorithms (especially in breast cancer diagnosis) may not only restrict the current performance of those models —the common claim that some of these models have already achieved a diagnostic performance similar to human experts could also give clinicians a false sense of security when using image algorithms that have, however, not yet been (prospectively) compared against clinical routine performance. Moreover, AI-based algorithms in the field of breast imaging are often compared to the categorical BI-RADS assessment. As AI-based algorithms produce a continuous risk of malignancy as output, this may inherently lead to a higher performance when comparing AI-based algorithms with BI-RADS categories. In our study, a continuous likelihood score of malignancy was assigned for all patients in addition to the BI-RADS score. While this approach is not validated and may still lead to bias towards higher performance for AI-based algorithms it may enable a fairer comparison between AI-based algorithms and the BI-RADS categories assigned by humans.

Second, our multi-modal ultrasound ML algorithms were trained on image features as well as clinical and demographic information, but the amount of documented, explainable information was limited to ultrasound features, patient age, and clinically suspicious palpability. Further work is warranted to develop (more reliable) diagnostic models which combine imaging with comprehensive clinical and demographic data to fully represent the clinical reality (see commonly considered variables according to the National Comprehensive Cancer Network guideline for breast cancer screening [5]). Moreover, current research evaluates the feasibility of automated breast ultrasound and its combination with digital breast tomosynthesis which may further advance (automated) multi-modal breast image analysis in the future [30, 31].

Third, relying on traditional group-level associations may contribute to the ongoing discussion about high false-positive rates in breast diagnosis [32], which was also observed in our study (46% specificity for the clinical routine assessment in the whole cohort, Table 5). Individualized predictions by complex risk models may help improve diagnostic accuracy to avoid physical and psychological distress for patients and reduce treatment burden for providers and healthcare systems.

Fourth, algorithms for medical image analysis or classification are developed either by using hand-crafted image features that are analyzed by ML algorithms or by using deep learning techniques that do not require prior feature extraction [1]. In our study, we used the first approach. Although deep learning techniques showed great potential for automated medical image analysis in the past decade, they commonly do not outperform humans in image detection or classification tasks [3]. In fact, for some classification tasks, analyzing hand-crafted image features showed to be superior to deep learning approaches in small- to medium-sized datasets [33]. As the aim of our present analysis was to demonstrate the inherent limitations of developing AI-based algorithms on unimodal instead of multi-modal information and comparing their performance against image readers instead of clinical routine decisions, we do not expect the choice of feature-based machine learning instead of deep learning algorithms to limit our findings.

Fifth, our ultrasound experts performed a second read of all ultrasound images instead of performing the examination themselves. Although some evidence suggests that the interpretation of dynamic videos versus static images does not impair diagnostic performance, this may have caused some bias in our study and may have influenced the performance of the ultrasound experts [34].

Conclusions

We show that beyond-human performance on imaging classification tasks does not necessarily yield state-of-the-art diagnostic decisions when compared to physicians who can evaluate multiple imaging sources alongside other relevant demographic and clinical information. AI-based algorithms that are not developed on multi-modal routine information (imaging, demographic, and clinical information) and that are subsequently not compared to the performance of this clinical routine may not represent state-of-the-art diagnostic performance. Confidence in AI-based algorithms that rely on solely one imaging modality may result in a misleading sense of security among clinicians. Further work is warranted to develop and evaluate individualized diagnostic models which combine imaging with comprehensive clinical and demographic data to better represent the diagnostic pathway of routine clinical breast diagnosis.