The importance of multi-modal imaging and clinical information for humans and AI-based algorithms to classify breast masses (INSPiRED 003): an international, multicenter analysis

Pfob, André; Sidey-Gibbons, Chris; Barr, Richard G.; Duda, Volker; Alwafai, Zaher; Balleyguier, Corinne; Clevert, Dirk-André; Fastner, Sarah; Gomez, Christina; Goncalo, Manuela; Gruber, Ines; Hahn, Markus; Hennigs, André; Kapetas, Panagiotis; Lu, Sheng-Chieh; Nees, Juliane; Ohlinger, Ralf; Riedel, Fabian; Rutten, Matthieu; Schaefgen, Benedikt; Schuessler, Maximilian; Stieber, Anne; Togawa, Riku; Tozaki, Mitsuhiro; Wojcinski, Sebastian; Xu, Cai; Rauch, Geraldine; Heil, Joerg; Golatta, Michael

doi:10.1007/s00330-021-08519-z

The importance of multi-modal imaging and clinical information for humans and AI-based algorithms to classify breast masses (INSPiRED 003): an international, multicenter analysis

Breast
Open access
Published: 17 February 2022

Volume 32, pages 4101–4115, (2022)
Cite this article

Download PDF

You have full access to this open access article

European Radiology Aims and scope Submit manuscript

The importance of multi-modal imaging and clinical information for humans and AI-based algorithms to classify breast masses (INSPiRED 003): an international, multicenter analysis

Download PDF

André Pfob^1,2,
Chris Sidey-Gibbons^2,3,
Richard G. Barr⁴,
Volker Duda⁵,
Zaher Alwafai⁶,
Corinne Balleyguier⁷,
Dirk-André Clevert⁸,
Sarah Fastner¹,
Christina Gomez¹,
Manuela Goncalo⁹,
Ines Gruber¹⁰,
Markus Hahn¹⁰,
André Hennigs¹,
Panagiotis Kapetas¹¹,
Sheng-Chieh Lu^2,3,
Juliane Nees¹,
Ralf Ohlinger⁶,
Fabian Riedel¹,
Matthieu Rutten^12,13,
Benedikt Schaefgen¹,
Maximilian Schuessler¹⁴,
Anne Stieber¹,
Riku Togawa¹,
Mitsuhiro Tozaki¹⁵,
Sebastian Wojcinski¹⁶,
Cai Xu^2,3,
Geraldine Rauch¹⁷,
Joerg Heil¹ &
…
Michael Golatta ORCID: orcid.org/0000-0002-2605-0060¹

3108 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

Objectives

AI-based algorithms for medical image analysis showed comparable performance to human image readers. However, in practice, diagnoses are made using multiple imaging modalities alongside other data sources. We determined the importance of this multi-modal information and compared the diagnostic performance of routine breast cancer diagnosis to breast ultrasound interpretations by humans or AI-based algorithms.

Methods

Patients were recruited as part of a multicenter trial (NCT02638935). The trial enrolled 1288 women undergoing routine breast cancer diagnosis (multi-modal imaging, demographic, and clinical information). Three physicians specialized in ultrasound diagnosis performed a second read of all ultrasound images. We used data from 11 of 12 study sites to develop two machine learning (ML) algorithms using unimodal information (ultrasound features generated by the ultrasound experts) to classify breast masses which were validated on the remaining study site. The same ML algorithms were subsequently developed and validated on multi-modal information (clinical and demographic information plus ultrasound features). We assessed performance using area under the curve (AUC).

Results

Of 1288 breast masses, 368 (28.6%) were histopathologically malignant. In the external validation set (n = 373), the performance of the two unimodal ultrasound ML algorithms (AUC 0.83 and 0.82) was commensurate with performance of the human ultrasound experts (AUC 0.82 to 0.84; p for all comparisons > 0.05). The multi-modal ultrasound ML algorithms performed significantly better (AUC 0.90 and 0.89) but were statistically inferior to routine breast cancer diagnosis (AUC 0.95, p for all comparisons ≤ 0.05).

Conclusions

The performance of humans and AI-based algorithms improves with multi-modal information.

Key Points

• The performance of humans and AI-based algorithms improves with multi-modal information.

• Multimodal AI-based algorithms do not necessarily outperform expert humans.

• Unimodal AI-based algorithms do not represent optimal performance to classify breast masses.

Multi-modal artificial intelligence for the combination of automated 3D breast ultrasound and mammograms in a population of women with predominantly dense breasts

Article Open access 16 January 2023

One step further into the blackbox: a pilot study of how to build more confidence around an AI-based decision system of breast nodule assessment in 2D ultrasound

Article 06 January 2021

International evaluation of an AI system for breast cancer screening

Article 01 January 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The use of automated medical image analysis by AI-based algorithms has generated great enthusiasm: world-class radiological evaluations may become frequently available for low-income countries, rural areas, or physicians in training [1]. Moreover, the automated evaluation of images may help radiologists in managing the increasing workload demands [2]. Algorithms for medical image analysis are developed either by using hand-crafted image features (extracted automatically or by human readers) that are analyzed by machine learning algorithms or by using deep learning techniques that do not require prior feature extraction [1]. Such algorithms have already shown great diagnostic performance comparable to human expert readers in some areas [3]. However, a recent survey among the members of the American College of Radiology and the Radiological Society of North America showed that very few physicians use such imaging algorithms in their practice (about 30%, mainly for research purposes) and that among those, 93% reported inconsistent results of these algorithms in practice. About 95% said they would not put their faith into a diagnosis solely made by an algorithm (although some of them have FDA clearance) [4]. The discrepancy between the excellent performance reported by newly developed imaging algorithms and their non-use in clinical practice as well as the reluctance expressed by human imaging experts seems striking. An explanation for this may be that algorithms which are trained on image data alone may perform on par with human image readers when looking only at those images — but this does not represent the clinical reality in which imaging information (of multiple imaging modalities) is often considered alongside contextualizing clinical and demographic information.

Taking breast cancer diagnosis as an example, several imaging modalities (usually ultrasound and mammography, sometimes MRI) are used to evaluate indeterminate breast masses in combination with clinical and demographic information like patient age, suspicious palpability, disease history, and family medical history [5, 6]. Especially breast ultrasound has been under intense evaluation over the past years as it showed potential to identify cancers that are initially missed in mammography but ultrasound also leads to more false-positive findings [7]. The absent integration of contextualizing clinical and demographic information and of different imaging modalities into AI-based, diagnostic algorithms (especially in breast cancer diagnosis) may restrict the current performance of those diagnostic models. Although this knowledge gap has important implications for clinical practice, it has not been addressed systematically yet.

In this study, we compared the diagnostic performance of routine breast cancer diagnosis to breast ultrasound interpretations by humans or AI-based algorithms which were trained either on unimodal information (ultrasound features) or on multi-modal information (clinical and demographic information in addition to ultrasound features) to classify breast masses. We hypothesized that both humans and AI-based algorithms can improve their performance when considering multi-modal instead of unimodal information. For our analysis, we used data of an international multicenter trial that evaluated the use of a new ultrasound technique compared to traditional B-mode breast ultrasound [8].

Material and methods

Patient recruitment and selection

Patients were recruited as part of an international multicenter trial (NCT02638935). The trial was conducted at 12 study sites across 7 countries (Austria, France, Germany, Japan, Portugal, The Netherlands, the USA) from February 2016 to March 2019. Women aged 18 years or older who presented with an indeterminate breast mass ≥ 0.5 and ≤ 5 cm in largest diameter size in 2D B-mode ultrasound were enrolled. Only one mass per patient was included. As by requirement of the parental trial, all patients underwent histopathological confirmation.

Design and definitions

In the clinical routine, a breast mass was classified as (potentially) benign or malignant after evaluating different imaging modalities (mammography, 2D B-mode ultrasound, and/or MRI, as applicable in clinical routine) alongside additional demographic and clinical information about the patients’ age, disease history, and family medical history. Three physicians specialized in ultrasound diagnosis from separate study sites performed a second read of all ultrasound images, without access to any clinical information on patients. The three ultrasound experts, who had 10 to 30 years of experience in breast cancer diagnosis, consisted of one radiology professor, one professor specialized in breast diagnosis (head of breast diagnosis), and one senior physician specialized in breast diagnosis (head of breast diagnosis).

The risk of malignancy was evaluated according to the American College of Radiology (ACR) BI-RADS criteria and a BI-RADS score was assigned for all patients in the clinical routine and by the ultrasound experts. BI-RADS assigns risk categories to breast masses: BI-RADS III is assigned for patients with a risk of malignancy > 0% but ≤ 2%, BI-RADS IV for > 2% but ≤ 95%, and BI-RADS V for > 95% risk of malignancy. To further refine this broad risk assessment, a continuous likelihood score of malignancy was assigned for all patients in addition to the BI-RADS score. Of the single variables that were considered to evaluate the risk of malignancy in the clinical routine, the single BI-RADS descriptors of the ultrasound evaluation, patient age, and palpability of the lesion were specifically documented for this trial.

For comparison, we developed and validated two machine learning (ML) algorithms trained on unimodal information (ultrasound features generated by the ultrasound experts, see Table 1) to classify breast masses. The same ML algorithms were subsequently trained on multi-modal information (clinical and demographic information in addition to ultrasound features). The full list of variables is shown in Table 1.

Table 1 Distribution of baseline and outcomes variables in the whole cohort and in the development and validation datasets

Full size table

Following ACR BI-RADS guidelines, we assumed breast masses to be malignant when the risk of malignancy was equal to or above 2% according to BI-RADS 4 or 5. All patients underwent histopathologic confirmation against which the diagnostic predictions were compared.

Algorithm development

Choice of algorithms and reporting on them were informed by guidelines on how to use ML in medicine [9], how to report findings of diagnostic tests [10], and multivariate prediction models [11] as well as previously published research by our group [12,13,14,15] We developed and validated two algorithms to predict malignancy of a breast mass:

1.
Logistic regression (LR) with elastic net penalty: We chose this algorithm because of its ability to attenuate the influence of certain predictors on the model, leading to greater generalizability to new datasets [16, 17]. This algorithm is limited to identifying linear relations between the predictor variables and the outcome.
2.
Extreme gradient boosting (XGBoost) tree: Gradient boosting refers to a machine learning technique in which the final prediction model consists of an ensemble of several, stepwise built models [18]. Gradient boosting is commonly applied to decision trees which results in an ensemble model combining the prediction of several trees. We chose this algorithm because of its ability to identify more complex, non-linear patterns in data while still being interpretable.

Algorithms were trained and tuned on the development set using tenfold cross-validation; a hypergrid-search was used for hyperparameter tuning (see Supplementary Appendix for optimal hyperparameters, the results of the internal testing, and data preparation steps). The final model was then externally validated using an independent dataset. As this was a large international multicenter trial, we selected one trial site as an independent validation dataset on which the final model was (externally) validated. Guidelines for multivariable risk prediction models recommend validation of such a model in a dataset of at least 100 events [11]. Only one trial site had at least 100 events and was thus used as validation set (study site 1 of the parental trial) [8]. The other 11 trial sites were used as a development set.

We provide a more detailed description of all algorithms and the algorithm development as well as a detailed evaluation of our study according to the abovementioned guidelines [9,10,11]. in the online Supplementary Appendix.

Statistical analysis

Descriptive statistics including absolute and relative frequencies as well as chi-square tests for categorical data and mean and standard deviation were used alongside t-tests for continuous data to compare the distribution of baseline and outcome variables in the development and validation sets.

To assess the diagnostic performance in classifying breast masses of the clinical routine, the ultrasound experts, the unimodal and multi-modal ultrasound ML algorithms, area under the receiver-operating characteristics curve (AUC), and accompanying 95% confidence intervals were calculated for every model using 2000 bootstrap replicates that were drawn from the validation dataset and stratified for the outcome variable (malignant/benign). We conducted subgroups analyses to compare the AUC of the unimodal and multi-modal ultrasound ML models in the external validation set across different age groups (< 50 years, ≥ 50 years) and across different histopathologic subgroups (malignant vs. benign).

Additionally, we compared sensitivity, specificity, and negative- and positive-predictive values to the gold standard of histopathologic evaluation and against each other; we computed 95% Clopper-Pearson confidence intervals.

Calibration of the ML models was evaluated using calibration plots (observed vs. predicted probabilities [19]) and Spiegelhalter’s Z statistic [20].

No multiplicity adjustments against type-I-error inflation were performed. Thus, these analyses are of descriptive nature. All p values must be interpreted descriptively and have no confirmatory value. Analysis was conducted using R software, Version 3.6.1 (the “caret” package of R was used for the model development).

Ethical considerations

The trial was approved by all respective ethical committees and all participants gave their written informed consent. The research reported in this article complies with the Declaration of Helsinki.

Results

Patient recruitment

A total of 1294 women were enrolled. Six were excluded from the analysis because no data on the pathologic evaluation was available. The remaining 1288 underwent full clinical breast evaluations including clinical examinations and multi-modal imaging (B-mode breast ultrasound, mammography, MRI as applicable in clinical routine) followed by histopathologic evaluation of the mass.

Baseline demographic and clinical characteristics

Table 1 shows the distribution of baseline and outcome variables for the whole cohort and in the development and external validation datasets that were used for the algorithm development and validation. In the whole cohort, the mean age was 46 years (standard deviation 16.0) and 368 of 1288 breast masses (28.6%) were malignant as confirmed by histopathology. When comparing the development (n = 915) and the external validation (n = 373) datasets, the validation set had a significantly higher proportion of histopathologically malignant masses (33.8% vs. 26.4%, p = 0.010) and a higher proportion of masses with clinically suspicious palpability (63.5% vs. 45.9%, p < 0.001), as well as some variations in the tissue composition, mass margins, echo pattern, and posterior features in the B-mode ultrasound images.

Diagnostic performance evaluation

Diagnostic performance metrics of the clinical routine, of the three ultrasound experts, of the unimodal ultrasound ML algorithms, and of the multi-modal ultrasound ML algorithms in the validation set are shown in Table 2. AUROC in the clinical routine was 0.95 (95% CI 0.93 to 0.97), for the multi-modal ultrasound LR with elastic net penalty algorithm 0.90 (95% CI 0.87 to 0.93), for the multi-modal ultrasound XGBoost tree algorithm 0.89 (95% CI 0.85 to 0.92), for the unimodal ultrasound LR with elastic net penalty algorithm 0.83 (95% CI 0.78 to 0.87), for the unimodal ultrasound XGBoost tree algorithm 0.82 (95% CI 0.77 to 0.86), and 0.82 (95% CI 0.77 to 0.87), 0.82 (95% CI 0.77 to 0.87), and 0.84 (95% CI 0.79 to 0.89) for the ultrasound experts 1–3, respectively.

Table 2 Diagnostic performance of routine clinical breast diagnosis, the three ultrasound experts, the unimodal ultrasound machine learning algorithms, and the multi-modal ultrasound machine learning algorithms in the validation set

Full size table

Figure 1 summarizes the comparison in diagnostic performance between the different approaches: the performance of the unimodal ultrasound ML algorithms did not differ significantly from the performance of the ultrasound experts (p 0.361 to 0.935); the multi-modal ultrasound ML algorithms performed significantly better compared to all ultrasound experts and the unimodal algorithms (all p < 0.001); the clinical routine breast diagnosis performed significantly better compared to all other approaches (all p < 0.01). Corresponding ROC curves are illustrated in Fig. 2.

Calibration plots of the ML models are illustrated in Supplemental Fig. 1 and indicate well-calibrated models, which was confirmed by Spiegelhalter’s Z (good calibration in 3 out of 4 models, see Supplementary Appendix). The unimodal LR with elastic net penalty showed an impaired calibration for mid-range probabilities of malignancy.

Insights into model predictions and traditional multivariable logistic regression

Predictive coefficients of the unimodal and multi-modal LR with elastic net penalty are listed in Table 3. For the multi-modal algorithm, patient age was the most important predictor of malignancy (regularized ß = 7.60, 95% CI 7.53 to 7.73), followed by spiculated margins (regularized ß = 1.10, 95% CI 0.21 to 1.99), a not parallel orientation of the mass (regularized ß = 0.88, 95% CI 0.52 to 1.24), clinically suspicious palpability (regularized ß = 0.84, 95% CI 0.56 to 1.28), and an irregular shape (regularized ß = 0.54, 95% CI 0.0 to 1.08).

Table 3 Predictive coefficients of the uni- and multi-modal logistic regression with elastic net penalty algorithms

Full size table

Figure 3 illustrates insights into the variable importance for the predictions made by the XGBoost tree via Shapley Additive Explanations (SHAP) values.

For comparison, odds ratios of a traditional multivariable logistic regression are listed in Table 4.

Table 4 Traditional multivariable logistic regression

Full size table

Subgroup analyses

We evaluated the diagnostic performance of the multi-modal ultrasound ML models in the external validation set across different patient subgroups (see Supplemental Table 1). The algorithms performed equally well across different age groups (< 50 years and ≥ 50 years, p > 0.05). The algorithms showed higher performance among patients with benign compared to malignant histopathology (p < 0.05). Detailed AUC values are listed in Supplemental Table 1.

Further analyses

Table 5 shows the diagnostic performance of the clinical routine and of the three ultrasound experts in the whole cohort of 1288 patients. Their performance in the whole cohort and in the validation set did not differ significantly.

Table 5 Diagnostic performance of the clinical routine and of the three ultrasound experts in the whole cohort (n = 1288)

Full size table

Exemplary images

Ultrasound images of two exemplary patients are shown in Fig. 4 and illustrate the importance of contextualizing demographic and clinical patient information.

Discussion

In this study, we compared the diagnostic performance of routine breast cancer diagnosis to breast ultrasound interpretations by humans or AI-based algorithms which were trained either on unimodal information (ultrasound features) or on multi-modal information (clinical and demographic information in addition to ultrasound features) to classify breast masses. Our classification algorithms showed equivalent or better performance compared to human readers in the classification of breast masses on ultrasound images. We show that beyond-human performance on imaging classification tasks does not necessarily yield state-of-the-art diagnostic decisions when compared to physicians who can evaluate multiple imaging sources alongside other relevant demographic and clinical information. We demonstrate that AI-based algorithms, like humans, can improve diagnostic accuracy of breast cancer classification by considering image data in combination with data on individuals’ demographics and clinical status. Contextualizing clinical and demographic information is a key element in the diagnostic pathway — even when imaging interpretation is optimized or enhanced by AI-based algorithms, there is an inherent limitation in accuracy of using only one imaging modality for breast cancer diagnosis. Further work is warranted to develop and evaluate individualized diagnostic models which combine imaging with comprehensive clinical and demographic data to better represent the diagnostic pathway of routine clinical breast diagnosis.

In interpreting our findings, some points should be further discussed. First, even the most AI-based imaging algorithms might be limited when evaluating only images of one imaging modality without contextualizing clinical and demographic information. This becomes evident when looking at the two exemplary patients whose ultrasound images are illustrated in Fig. 4. Moreover, a recent systematic review on AI-based image analysis identified 9 studies in the field of breast imaging [3]. Of these 9 studies, all reported that the developed algorithm showed a diagnostic performance comparable to that of human experts but all compared the performance against human image readers and not against full diagnostic evaluations in the clinical routine, all algorithms were trained on unimodal imaging information (7 ultrasound, 2 mammography), and only 3 were externally validated [21,22,23,24,25,26,27,28,29]. Analyzing contextualizing patient information for complex risk assessments by AI-based algorithms has yielded promising results in other fields [13,14,15] Thus, the absent integration of contextualizing clinical and demographic information and of different imaging modalities into AI-based, diagnostic algorithms (especially in breast cancer diagnosis) may not only restrict the current performance of those models —the common claim that some of these models have already achieved a diagnostic performance similar to human experts could also give clinicians a false sense of security when using image algorithms that have, however, not yet been (prospectively) compared against clinical routine performance. Moreover, AI-based algorithms in the field of breast imaging are often compared to the categorical BI-RADS assessment. As AI-based algorithms produce a continuous risk of malignancy as output, this may inherently lead to a higher performance when comparing AI-based algorithms with BI-RADS categories. In our study, a continuous likelihood score of malignancy was assigned for all patients in addition to the BI-RADS score. While this approach is not validated and may still lead to bias towards higher performance for AI-based algorithms it may enable a fairer comparison between AI-based algorithms and the BI-RADS categories assigned by humans.

Second, our multi-modal ultrasound ML algorithms were trained on image features as well as clinical and demographic information, but the amount of documented, explainable information was limited to ultrasound features, patient age, and clinically suspicious palpability. Further work is warranted to develop (more reliable) diagnostic models which combine imaging with comprehensive clinical and demographic data to fully represent the clinical reality (see commonly considered variables according to the National Comprehensive Cancer Network guideline for breast cancer screening [5]). Moreover, current research evaluates the feasibility of automated breast ultrasound and its combination with digital breast tomosynthesis which may further advance (automated) multi-modal breast image analysis in the future [30, 31].

Third, relying on traditional group-level associations may contribute to the ongoing discussion about high false-positive rates in breast diagnosis [32], which was also observed in our study (46% specificity for the clinical routine assessment in the whole cohort, Table 5). Individualized predictions by complex risk models may help improve diagnostic accuracy to avoid physical and psychological distress for patients and reduce treatment burden for providers and healthcare systems.

Fourth, algorithms for medical image analysis or classification are developed either by using hand-crafted image features that are analyzed by ML algorithms or by using deep learning techniques that do not require prior feature extraction [1]. In our study, we used the first approach. Although deep learning techniques showed great potential for automated medical image analysis in the past decade, they commonly do not outperform humans in image detection or classification tasks [3]. In fact, for some classification tasks, analyzing hand-crafted image features showed to be superior to deep learning approaches in small- to medium-sized datasets [33]. As the aim of our present analysis was to demonstrate the inherent limitations of developing AI-based algorithms on unimodal instead of multi-modal information and comparing their performance against image readers instead of clinical routine decisions, we do not expect the choice of feature-based machine learning instead of deep learning algorithms to limit our findings.

Fifth, our ultrasound experts performed a second read of all ultrasound images instead of performing the examination themselves. Although some evidence suggests that the interpretation of dynamic videos versus static images does not impair diagnostic performance, this may have caused some bias in our study and may have influenced the performance of the ultrasound experts [34].

Conclusions

We show that beyond-human performance on imaging classification tasks does not necessarily yield state-of-the-art diagnostic decisions when compared to physicians who can evaluate multiple imaging sources alongside other relevant demographic and clinical information. AI-based algorithms that are not developed on multi-modal routine information (imaging, demographic, and clinical information) and that are subsequently not compared to the performance of this clinical routine may not represent state-of-the-art diagnostic performance. Confidence in AI-based algorithms that rely on solely one imaging modality may result in a misleading sense of security among clinicians. Further work is warranted to develop and evaluate individualized diagnostic models which combine imaging with comprehensive clinical and demographic data to better represent the diagnostic pathway of routine clinical breast diagnosis.

Abbreviations

ACR:: American College of Radiology
AUC:: Area under the curve
BI-RADS:: Breast Imaging Reporting and Data System
CI:: Confidence interval
LR:: Logistic regression
ML:: Machine learning
ROC:: Receiver-operating characteristic
XGBoost Tree:: Extreme Gradient Boosting Tree

References

Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL (2018) Artificial intelligence in radiology. Nat Rev Cancer 18(8):500–510. https://doi.org/10.1038/s41568-018-0016-5
Article CAS PubMed PubMed Central Google Scholar
McDonald RJ, Schwartz KM, Eckel LJ et al (2015) The effects of changes in utilization and technological advancements ofcross-sectional imaging onradiologist workload. Acad Radiol 22(9):1191–1198. https://doi.org/10.1016/j.acra.2015.05.007
Article PubMed Google Scholar
Liu X, Faes L, Kale AU et al (2019) A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Heal 1(6):e271–e297. https://doi.org/10.1016/S2589-7500(19)30123-2
Article Google Scholar
American College of Radiology. Subject: (Docket No. FDA-2019-N-5592) “Public Workshop - Evolving Role of Artificial Intelligence in Radiological Imaging;” Comments of the American College of Radiology. https://www.acr.org/-/media/ACR/NOINDEX/Advocacy/acr_rsna_comments_fda-ai-evolvingrole-ws_6-30-2020.pdf. Published 2020. Accessed 3 Apr 2021
National Comprehensive Cancer Network (2020) Breast cancer screening and diagnosis. Harborside Press
Google Scholar
Wöckel A, Festl J, Stüber T et al (2018) Interdisciplinary screening, diagnosis, therapy and follow-up of breast cancer. Guideline of the DGGG and the DKG (S3-Level, AWMF Registry Number 032/045OL, December 2017) - Part 1 with Recommendations for the Screening, Diagnosis and Therapy of Breast Cancer. Geburtshilfe Frauenheilkd. 78(10):927–948. https://doi.org/10.1055/a-0646-4522
Yang L, Wang S, Zhang L et al (2020) Performance of ultrasonography screening for breast cancer: a systematic review and meta-analysis. BMC Cancer 20(1):499. https://doi.org/10.1186/s12885-020-06992-1
Article PubMed PubMed Central Google Scholar
Golatta M, Pfob A, Büsch C et al (2021) The potential of shear wave elastography to reduce unnecessary biopsies in breast cancer diagnosis: an international, diagnostic, multicenter trial. Ultraschall Med. https://doi.org/10.1055/A-1543-6156
Article PubMed Google Scholar
Liu Y, Chen PHC, Krause J, Peng L (2020) How to read articles that use machine learning: users’ guides to the medical literature. JAMA 322(18):1806–1816. https://doi.org/10.1001/jama.2019.16489
Article Google Scholar
Cohen JF, Korevaar DA, Altman DG et al (2016) STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open 6(11):e012799. https://doi.org/10.1136/bmjopen-2016-012799
Article PubMed PubMed Central Google Scholar
Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med 162(1):55–63. https://doi.org/10.7326/M14-0697
Article PubMed Google Scholar
Sidey-Gibbons JAM, Sidey-Gibbons CJ (2019) Machine learning in medicine: a practical introduction. BMC Med Res Methodol 19(1):1–18. https://doi.org/10.1186/s12874-019-0681-4
Article Google Scholar
Pfob A, Sidey-Gibbons C, Lee H-B et al (2021) Identification of breast cancer patients with pathologic complete response in the breast after neoadjuvant systemic treatment by an intelligent vacuum-assisted biopsy. Eur J Cancer 143:134–146. https://doi.org/10.1016/j.ejca.2020.11.00
Article CAS PubMed Google Scholar
Pfob A, Mehrara BJ, Nelson JA, Wilkins EG, Pusic AL, Sidey-Gibbons C (2021) Towards patient-centered decision-making in breast cancer surgery. Ann Surg. https://doi.org/10.1097/SLA.0000000000004862
Article PubMed Google Scholar
Sidey-Gibbons C, Pfob A, Asaad M et al (2021) Development of machine learning algorithms for the prediction of financial toxicity in localized breast cancer following surgical treatment. JCO Clin Cancer Inform 5(5):338–347. https://doi.org/10.1200/CCI.20.00088
Article PubMed Google Scholar
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395. https://doi.org/10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Article Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451
Article Google Scholar
Harrell FE, Lee KL, Mark DB (1996) Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15(4):361–387. https://doi.org/10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
Spiegelhalter DJ (1986) Probabilistic prediction in patient management and clinical trials. Stat Med 5(5):421–433. https://doi.org/10.1002/sim.4780050506
Article CAS PubMed Google Scholar
Gargouri Ben Ayed N, Dammak Masmoudi A, Sellami D, Abid R (2015) New developments in the diagnostic procedures to reduce prospective biopsies breast. In: 2015 International Conference on Advances in Biomedical Engineering, ICABME 2015. Institute of Electrical and Electronics Engineers Inc.; 2015:205–208. https://doi.org/10.1109/ICABME.2015.7323288
Becker AS, Mueller M, Stoffel E, Marcon M, Ghafoor S, Boss A. Classification of breast cancer in ultrasound imaging using a generic deep learning analysis software: a pilot study. Br J Radiol. 2018;91(1083) https://doi.org/10.1259/bjr.20170576
Lin CM, Hou YL, Chen TY, Chen KH (2014) Breast nodules computer-aided diagnostic system design using fuzzy cerebellar model neural networks. IEEE Trans Fuzzy Syst 22(3):693–699. https://doi.org/10.1109/TFUZZ.2013.2269149
Article Google Scholar
Kim SM, Han H, Park JM et al (2012) A comparison of logistic regression analysis and an artificial neural network using the BI-RADS lexicon for ultrasonography in conjunction with introbserver variability. J Digit Imaging 25(5):599–606. https://doi.org/10.1007/s10278-012-9457-7
Article PubMed PubMed Central Google Scholar
Fujioka T, Kubota K, Mori M et al (2019) Distinction between benign and malignant breast masses at breast ultrasound using deep learning method with convolutional neural network. Jpn J Radiol. 37(6):466–472. https://doi.org/10.1007/s11604-019-00831-5
Article PubMed Google Scholar
Choi JS, Han BK, Ko ES et al (2019) Effect of a deep learning framework-based computer-aided diagnosis system on the diagnostic performance of radiologists in differentiating between malignant and benign masses on breast ultrasonography. Korean J Radiol 20(5):749–758. https://doi.org/10.3348/kjr.2018.0530
Article PubMed PubMed Central Google Scholar
Byra M, Galperin M, Ojeda-Fournier H et al (2019) Breast mass classification in sonography with transfer learning using a deep convolutional neural network and color conversion. Med Phys 46(2):746–755. https://doi.org/10.1002/mp.13361
Article PubMed Google Scholar
Becker AS, Marcon M, Ghafoor S, Wurnig MC, Frauenfelder T, Boss A (2017) Deep learning in mammography diagnostic accuracy of a multipurpose image analysis software in the detection of breast cancer. Invest Radiol 52(7):434–440. https://doi.org/10.1097/RLI.0000000000000358
Article PubMed Google Scholar
Stoffel E, Becker AS, Wurnig MC et al (2018) Distinction between phyllodes tumor and fibroadenoma in breast ultrasound using deep learning image analysis. Eur J Radiol Open 5:165–170. https://doi.org/10.1016/j.ejro.2018.09.002
Article PubMed PubMed Central Google Scholar
Golatta M, Franz D, Harcos A et al (2013) Interobserver reliability of automated breast volume scanner (ABVS) interpretation and agreement of ABVS findings with hand held breast ultrasound (HHUS), mammography and pathology results. Eur J Radiol 82(8):e332–e336. https://doi.org/10.1016/j.ejrad.2013.03.005
Article PubMed Google Scholar
Schäfgen B, Juskic M, Radicke M et al (2020) Evaluation of the FUSION-X-US-II prototype to combine automated breast ultrasound and tomosynthesis. Eur Radiol. https://doi.org/10.1007/s00330-020-07573-3
Article PubMed PubMed Central Google Scholar
Le MT, Mothersill CE, Seymour CB, Mcneill FE (2016) Is the false-positive rate inmammography in North America too high? Br J Radiol. 89(1065):20160045. https://doi.org/10.1259/bjr.20160045
Article PubMed PubMed Central Google Scholar
Lin W, Hasenstab K, Moura Cunha G, Schwartzman A (2020) Comparison of handcrafted features and convolutional neural networks for liver MR image adequacy assessment. Sci Rep 10(1):1–11. https://doi.org/10.1038/s41598-020-77264-y
Article CAS Google Scholar
Youk JH, Jung I, Yoon JH, et al. Comparison of inter-observer variability and diagnostic performance of the Fifth Edition of BI-RADS for breast ultrasound of static versus video images. Ultrasound Med Biol. 2016;42(9):2083–2088. https://doi.org/10.1016/j.ultrasmedbio.2016.05.006

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

University Breast Unit, Department of Obstetrics and Gynecology, Heidelberg University Hospital, Im Neuenheimer Feld 440, 69120, Heidelberg, Germany
André Pfob, Sarah Fastner, Christina Gomez, André Hennigs, Juliane Nees, Fabian Riedel, Benedikt Schaefgen, Anne Stieber, Riku Togawa, Joerg Heil & Michael Golatta
MD Anderson Center for INSPiRED Cancer Care (Integrated Systems for Patient-Reported Data), The University of Texas MD Anderson Cancer Center, Houston, TX, USA
André Pfob, Chris Sidey-Gibbons, Sheng-Chieh Lu & Cai Xu
Department of Symptom Research, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
Chris Sidey-Gibbons, Sheng-Chieh Lu & Cai Xu
Department of Radiology, Northeast Ohio Medical University, Ravenna, OH, USA
Richard G. Barr
Department of Gynecology and Obstetrics, University of Marburg, Marburg, Germany
Volker Duda
Department of Gynecology and Obstetrics, University of Greifswald, Greifswald, Germany
Zaher Alwafai & Ralf Ohlinger
Department of Radiology, Institut Gustave Roussy, Villejuif Cedex, France
Corinne Balleyguier
Department of Radiology, University Hospital Munich-Grosshadern, Munich, Germany
Dirk-André Clevert
Department of Radiology, University of Coimbra, Coimbra, Portugal
Manuela Goncalo
Department of Gynecology and Obstetrics, University of Tuebingen, Tuebingen, Germany
Ines Gruber & Markus Hahn
Department of Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Vienna, Austria
Panagiotis Kapetas
Department of Radiology, Jeroen Bosch Hospital, ‘s-Hertogenbosch, The Netherlands
Matthieu Rutten
Radboud University Medical Center, Nijmegen, The Netherlands
Matthieu Rutten
National Center for Tumor Diseases, Heidelberg University Hospital, Heidelberg, Germany
Maximilian Schuessler
Department of Radiology, Sagara Hospital, Kagoshima, Japan
Mitsuhiro Tozaki
Department of Gynecology and Obstetrics, Breast Cancer Center, Klinikum Bielefeld Mitte GmbH, Bielefeld, Germany
Sebastian Wojcinski
Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, Freie Universität Berlin, Humboldt-Universität Zu Berlin, Berlin , Germany
Geraldine Rauch

Authors

André Pfob
View author publications
You can also search for this author in PubMed Google Scholar
Chris Sidey-Gibbons
View author publications
You can also search for this author in PubMed Google Scholar
Richard G. Barr
View author publications
You can also search for this author in PubMed Google Scholar
Volker Duda
View author publications
You can also search for this author in PubMed Google Scholar
Zaher Alwafai
View author publications
You can also search for this author in PubMed Google Scholar
Corinne Balleyguier
View author publications
You can also search for this author in PubMed Google Scholar
Dirk-André Clevert
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Fastner
View author publications
You can also search for this author in PubMed Google Scholar
Christina Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Manuela Goncalo
View author publications
You can also search for this author in PubMed Google Scholar
Ines Gruber
View author publications
You can also search for this author in PubMed Google Scholar
Markus Hahn
View author publications
You can also search for this author in PubMed Google Scholar
André Hennigs
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Kapetas
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-Chieh Lu
View author publications
You can also search for this author in PubMed Google Scholar
Juliane Nees
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Ohlinger
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Riedel
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Rutten
View author publications
You can also search for this author in PubMed Google Scholar
Benedikt Schaefgen
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Schuessler
View author publications
You can also search for this author in PubMed Google Scholar
Anne Stieber
View author publications
You can also search for this author in PubMed Google Scholar
Riku Togawa
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuhiro Tozaki
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Wojcinski
View author publications
You can also search for this author in PubMed Google Scholar
Cai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Geraldine Rauch
View author publications
You can also search for this author in PubMed Google Scholar
Joerg Heil
View author publications
You can also search for this author in PubMed Google Scholar
Michael Golatta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Golatta.

Ethics declarations

Guarantor

The scientific guarantor of this publication is Michael Golatta.

Conflict of interest

The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.

Statistics and biometry

Several of the authors have significant statistical expertise (Geraldine Rauch and Chris Sidey-Gibbons).

Informed consent

Written informed consent was obtained from all subjects (patients) in this study.

Ethical approval

Institutional Review Board approval was obtained.

Study subjects or cohorts overlap

Some study subjects or cohorts have been previously reported in:

Golatta et al: The Potential of Shear Wave Elastography to reduce unnecessary Biopsies in Breast Cancer Diagnosis: An international, diagnostic, multicenter Trial.” Accepted June 28th, 2021.

Methodology

• prospective
• diagnostic or prognostic study
• multicenter study

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 83 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pfob, A., Sidey-Gibbons, C., Barr, R.G. et al. The importance of multi-modal imaging and clinical information for humans and AI-based algorithms to classify breast masses (INSPiRED 003): an international, multicenter analysis. Eur Radiol 32, 4101–4115 (2022). https://doi.org/10.1007/s00330-021-08519-z

Download citation

Received: 08 July 2021
Revised: 14 September 2021
Accepted: 17 October 2021
Published: 17 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s00330-021-08519-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The importance of multi-modal imaging and clinical information for humans and AI-based algorithms to classify breast masses (INSPiRED 003): an international, multicenter analysis

Abstract

Objectives

Methods

Results

Conclusions

Key Points

Similar content being viewed by others

Multi-modal artificial intelligence for the combination of automated 3D breast ultrasound and mammograms in a population of women with predominantly dense breasts

One step further into the blackbox: a pilot study of how to build more confidence around an AI-based decision system of breast nodule assessment in 2D ultrasound

International evaluation of an AI system for breast cancer screening

Explore related subjects

Introduction

Material and methods

Patient recruitment and selection

Design and definitions

Algorithm development

Statistical analysis

Ethical considerations

Results

Patient recruitment

Baseline demographic and clinical characteristics

Diagnostic performance evaluation

Insights into model predictions and traditional multivariable logistic regression

Subgroup analyses

Further analyses

Exemplary images

Discussion

Conclusions

Abbreviations

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Guarantor

Conflict of interest

Statistics and biometry

Informed consent

Ethical approval

Study subjects or cohorts overlap

Methodology

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (DOCX 83 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation