Background

Artificial intelligence (AI) is a broad term referring to the field of computer science that develops algorithms mimicking human cognitive functions such as learning, perception, problem-solving and decision-making. AI encompasses various approaches, including machine learning (ML) and deep learning. It comprises a range of technologies and techniques, including algorithmic decision-making (ADM) ([9]: 1). ADM refers to the process of using these algorithms to gather, process, model and use input data to make or support decisions. Feedback from these decisions can then be used for improving the system ([2]: 612). An ADM can take various forms depending on how it is framed and presented to the user or decision subject. It can be a simple algorithm that has been known and used for decades, such as classification trees [37], or a more complex system like a recommender or AI that can provide recommendations to human decision-makers, nudge its users in a certain direction or perform fully automated decision-making processes without human involvement ([2]: 613). We specify AI-related algorithmic decision-making systems (AI-related ADM) as decision support systems that either apply AI (relying on ML models) or have been developed with the help of AI.

Recent advances in AI have resulted in the development of more complex and sophisticated systems that can outperform humans in certain tasks. For example, in the field of computer vision, systems like DeepMind’s AlphaFold have revolutionised protein structure prediction, solving a decades-old challenge in biology by accurately predicting 3D protein structures [18]. Additionally, AI innovations have transformed financial services, with machine learning models now being used to predict market trends, optimise trading strategies and enhance fraud detection [12]. Furthermore, generative AI has demonstrated remarkable capabilities in generating human-like text and performing a wide range of language-related tasks with unprecedented accuracy [13]. Recently, ChatGPT was evaluated for its clinical reasoning ability by testing its performance on questions from the United States Medical Licensing Examination, where it scored at or near the passing threshold on all three exams without any special training or reinforcement [21].

These advances in AI seem to have enormous potential to transform many different fields and industries, which begs the question: will AI do so in healthcare?

In clinical trials, AI systems have already shown potential to help clinicians make better diagnoses [3, 22], help personalise medicine and monitor patient care [6, 16] and contribute to drug development [7]. However, successful application in practice is limited ([30]: 77) and potential issues that may be responsible for this gap between research and practice should be revealed by our work.

By searching PubMed for the term ‘artificial intelligence’, we found over 2000 systematic reviews and meta-analyses published in the last 10 years, with a yearly increasing trend. These include several reviews conducted in the area of AI in healthcare that provide an overview of the current state of AI technologies in specific clinical areas, including AI systems for breast cancer diagnosis in screening programmes [8], ovarian cancer [38], early detection of skin cancer [17], COVID-19 and other pneumonia [15], prediction of preterm birth [1] or diabetes management [19]. Other reviews have focused on comparing clinicians and AI systems in terms of their performance to show their capabilities in a clinical setting [24, 27, 34].

Although these reviews are crucial to the further development of AI systems, they offer little insight into whether patients actually benefit from their use by medical professionals. Indeed, these studies focus on the analytical performance of these systems, rather than on healthcare-related metrics. In most of the studies mentioned here, the underlying algorithms have been evaluated using a variety of parameters, such as the F1 score for error classification, balanced accuracy, false positive rate and area under the receiver operating characteristic curve (AUROC). However, measures of a system’s accuracy often provide non-replicable results ([25]: 4), do not necessarily indicate clinical efficiency ([20]: 1), AUROC does not necessarily indicate clinical applicability ([10]: 935) and in fact, none of these measures reflects beneficial change in patient care ([4]: 1727, [33]: 1).

To summarise, as with any other new technology introduced into healthcare, the clinical effectiveness and safety of AI compared to the standard of care must be evaluated through properly designed studies to ensure patient safety and maximise benefits while minimising any unintended harm ([31]: 328). Therefore, a critical analysis of patient-relevant outcomes is needed, especially the benefits and harms of decisions informed by or made by AI systems.

To this end, this review goes beyond previous studies in several ways. First, we study clinical AI systems that enable algorithmic decision-making (AI-related ADM) in general and therefore do not limit ourselves to selected clinical problems. In particular, we focus on machine learning systems that infer rules from observations. Although we omit rule-based systems, we apply the term AI throughout our work because it is often incorrectly and redundantly used for ML and deep learning in the literature we study. Second, we focus on studies that report patient-relevant outcomes that, according to German Institute for Quality and Efficiency in Healthcare ([14]: 44), describe how patients feel, how they can perform their functions and activities or if they survive. These may include, for example, mortality, morbidity (with regard to complaints and complications), length of hospital stay, readmission, time to intervention and health-related quality of life. Third, we focus only on studies that compare medical professionals supported by AI-related ADM systems with medical professionals without AI-related ADM systems (standard care). By doing so, this review provides an overview of the current literature on clinical AI-related ADM systems, summarises the empirical evidence on their benefits and harms for patients and highlights research gaps that need to be addressed in future studies.

Objectives

The aim of this review is to systematically assess the current evidence on patient-relevant benefits and harms of ADM systems which are developed or used with AI (AI-related ADM) to support medical professionals compared to medical professionals without this support (standard care).

  1. 1.

    Are there studies that compare patient-relevant effectiveness of AI-related ADM for medical professionals compared to medical professionals without AI-related ADM?

  2. 2.

    Do these studies show adequate methodological quality and are their findings generalisable?

  3. 3.

    Can AI-related ADM systems help medical professionals to make better decisions in terms of benefits and harms for patients?

Methods/design

In accordance with the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) statement [26], the study protocol for this systematic is registered on the International Prospective Register of Systematic Reviews (PROSPERO) database (CRD42023412156). If necessary, post-registration changes to the protocol will be detailed under the PROSPERO record with an accompanying rationale.

We will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [29] and the Methodological Expectations of Cochrane Intervention Reviews (MECIR) standards [11].

Searches

We will search systematically using English free text terms in title/abstract, Medical Subject Headings (MeSH) terms and Embase Subject Headings (Emtree) fields for various forms of keywords related to ‘artificial intelligence’ and relevant subcategories of computer generated and processed decision-making algorithms, ‘medical professionals’ and keywords describing effectiveness parameters and outcomes as well as preferred study types. Based on the block building approach, keywords and terms are combined using the Boolean operators AND and OR and progressively checked for relevant hits.

Databases to be used for searches

MEDLINE and PubMed (via PubMed), Embase (via Elsevier) and Institute of Electrical and Electronics Engineers (IEEE) Xplore will be searched for peer-reviewed articles as well as ClinicalTrials.gov and ICTRP (via CENTRAL) for ongoing trials and protocols.

To reduce potential publication bias, additional studies will be identified by contacting authors of included studies, contacting experts in the field and through reference lists of relevant studies. Grey literature searches will be conducted in Google Scholar. For this purpose, the keywords used in the systematic search will be used in different combinations, as well as their German equivalents. Google Scholar will be searched up to the 10th hit page. The detailed search strategy for each database will be reported under the PROSPERO record once the searches have been conducted.

Search strategy

We developed our search strategy using the PICOS scheme (Table 1).

Table 1 PICOS scheme

While doing preliminary searches for basic literature in MEDLINE and PubMed (via PubMed), we noticed that study conductors from different scientific fields (e.g. computer scientists) used different terms for the intervention outcomes we were looking for. In addition, some studies were not indexed appropriately in PubMed, which complicated our initial search strategy. To carry out the search strategy, we have created and tested the blocks consecutively to gather the best results from each block, expanding and narrowing the search strategy. To assess the right direction of the search strategy, we have used fundamental literature, such as Choudhury and Asan [5], Park et al. [31] and Nagendran et al. [27] as test sets, making sure the results of our search had common ground with these studies.

The resulting search string for MEDLINE and PubMed in the individual blocks can be found in Table 2 and describes the basis for other databases.

Table 2 Search string blocks for MEDLINE and PubMed (via PubMed)

Types of studies to be included

For the systematic search, peer-reviewed interventional and observational studies published in German or English 10 years retrospectively from the date of the search will be considered. For the search of grey literature, scientific reports published in German or English 10 years retrospectively from the date of the search will be considered. To extract potentially relevant studies from (systematic) reviews and meta-analyses, secondary studies will be gathered and screened. However, secondary studies will not be included in the synthesis.

In contrast to studies of effectiveness and safety, pure efficacy studies (e.g. focusing on algorithms accuracy) will be excluded as these outcomes are not directly relevant for patients. Patient-relevant outcomes will be defined according to the IQEHC method paper [14]. In addition, studies that used AI systems beyond our scope, such as robotics (systems that support the implementation of decisions), will be excluded. Editorials, commentaries, letters and other informal publication types will be excluded as well.

We will provide a list of all references screened in full text including exclusion reasons in the appendix of the final study.

Participants

Our study is focusing on human patients without restriction of age or sex. Therefore, the input data for the algorithms must include real human data gathered either during routine care and saved for use in research or generated specifically for the individual study.

Intervention

Out study is focusing on medical professionals utilising an AI-related ADM system to address a clinical problem.

In our working definition, a medical professional is a qualified individual who has the authority to perform necessary medical procedures within their professional scope of practice. Their goal is to improve, maintain or restore the health of individuals by examining, diagnosing, prognosticating and/or treating clinical problems. This may include medical doctors, registered nurses and other medical professionals. Clinical problems can encompass illnesses, injuries and physical or mental disorders, among other conditions.

In our working definition, an AI-related ADM system is a clinical decision support system that either applies AI in the sense of machine learning (ML, excluding rule-based systems) or has been developed with the help of ML. Clinical decision support models without any involvement of AI will be excluded.

Control

Medical professionals, as described in the working definition, are addressing a clinical problem without the support of an AI-related ADM system (standard care).

Outcomes

Patient-relevant benefits and harms, according to the IQEHC method paper [14], are gathered. These may include, for example, mortality, morbidity (with regard to complaints and complications), length of hospital stay, readmission, time to intervention and health-related quality of life.

Study types

We will collect both interventional and observational studies, which may encompass randomised controlled trials, cohort studies, case–control studies, randomised surveys, retrospective and prospective studies and phase studies, as well as non-inferiority or diagnostic studies.

Data extraction

Records arising from the literature search will be stored in the citation manager Citavi 6 (c) by Swiss Academic Software. After removing duplicates, two reviewers will independently review all titles and abstracts via the browser application Rayyan [28]. Studies potentially meeting the inclusion criteria will then be screened in full text independently by two reviewers using Citavi 6 (c). Disagreements over eligibility of studies will be discussed and, if necessary, resolved by a third reviewer. Authors of the included studies will be contacted if clarification of their data or study methods is required. The PRISMA 2020 flow diagram [29] will be used to keep the study selection process transparent.

Using a standardised data collection form, two reviewers will extract data independently from the included studies and will compare them for discrepancies. Missing data will be requested from study authors. Extracted data will include country of conduction, setting, study design, observational period, patient-relevant outcomes, intervention, comparator, characteristics of patient and medical professional populations and characteristics of the used algorithm. Additionally, studies will be classified by type of system, medical specialty or clinical area, prediction or classification goal of the AI-related ADM, supported decision, investigated benefits and harms, private or public study funding, applicable regulation (e.g. FDA, MDR), medical device classification (based on the risk and nature of the product) and whether the product is commercially available in its respective class (Table 3).

Table 3 Study data to be extracted

Risk of bias and quality assessment

Risk of bias will be assessed by using the revised Cochrane risk-of-bias tool for randomised trials (RoB 2) [36] and the risk-of-bias in non-randomised studies for interventions (ROBINS-I) tool [35]. Disagreements between the authors over the risk of bias in the included studies will be resolved by discussion or with involvement of a third author if necessary. Transparent reporting of the included studies will be assessed trough the Consolidated Standards of Reporting Trials interventions involving Artificial Intelligence (CONSORT-AI) extension by Liu et al. [23]. The CONSORT-AI extension includes 14 new items that were considered sufficiently important for AI interventions to be routinely reported in addition to the core CONSORT items by Schulz et al. [32]. CONSORT-AI aims to improve the transparency and completeness in reporting clinical trials for AI interventions. It will assist to understand, interpret and critically appraise the quality of clinical trial design and risk of bias in the reported outcomes. We will assess studies conducted prior to the introduction of the CONSORT-AI guidelines in 2020 against these standards where possible. Although these studies may not fully meet the new criteria, application of the guidelines may still identify potential reporting gaps and ensure a consistent assessment framework across studies. We will discuss limitations related to this retrospective requirement to ensure a balanced and comprehensive analysis.

Data synthesis

Given the expected likelihood of heterogeneity between studies in the different medical specialties in terms of outcome measures, study designs and interventions, we do not know if performing a meta-analysis will be possible. However, a systematic narrative synthesis will be provided of the results with an overview of the relevant effects for the outcomes, with information presented in the text and tables to summarise and explain the characteristics and findings of the included studies. We will analyse the geographic distribution, study settings and medical specialties of the included studies. Additionally, we will examine funding sources and conduct a detailed risk of bias assessment. Compliance with reporting standards, such as CONSORT-AI and TRIPOD-AI, will be evaluated. We also plan to analyse patient demographics, including age, sex and race/ethnicity, as well as the involvement and training of medical professionals. ADM systems will be categorised into applicable regulation (e.g. FDA, MDR), medical device classification (based on the risk and nature of the product) and whether the product is commercially available in its respective class. Outcome analyses will focus on assessing both benefits and harms. Furthermore, we will analyse the validation of algorithms, considering both internal and external validation, and review the data availability statements to evaluate the accessibility of data used for algorithm development. Studies with an unclear or high risk of bias are not excluded to avoid potential selection bias and to ensure that valuable findings, particularly in emerging areas, are not lost. By including them, but clearly acknowledging and discussing their limitations, we aim to provide a more comprehensive overview of the available evidence. For this reason, our narrative synthesis emphasises the qualitative aspects of the data and focuses on identifying and describing trends, patterns and inconsistencies in the studies, rather than attempting to quantify effect sizes. This is consistent with the approach of recent reviews examining the methodological quality of machine learning systems in clinical settings (e.g. [27]).

Discussion

It is to be expected that there is a significant lack of suitable studies comparing healthcare professionals with and without AI-related ADM systems regarding patient-relevant outcomes. It is assumed that this is due to, first, the lack of approval regulations for AI systems, second, the prioritisation of technical and clinical parameters over patient-relevant outcomes in the development of study designs and, third, the prioritisation of AI for supporting clinical processes (e.g. administration). In addition, it is to be expected that a large proportion of the studies to be identified are of rather poor methodological quality and provide results that are rather difficult to generalise. Although reporting guidelines such as the Consolidated Standards of Reporting Trials (CONSORT) statement [32] are well-known and widely used in medical and public health research, they do not necessarily correspond to the novel protocol and study designs that are relevant for the assessment of the research questions relevant here. The extension of the Reporting Guidelines for Clinical Study Reports of Interventions Using Artificial Intelligence (CONSORT-AI) [23] may fill the gap but this guideline is relatively new and not necessarily always applied.