1 Introduction

The United States Medical Licensing Examination (USMLE) is mandatory to practice medicine in the United States. It is a thorough assessment of physicians' knowledge and skills, providing a standardized measure of competence for both domestic and international medical graduates[1,2,3,4]. As such, the USMLE has become a critical benchmark for medical education and is increasingly used in research as a standard for testing the capabilities of various healthcare-focused artificial intelligence (AI) tools.

Advancements in natural language processing (NLP), have led to the development of large language models (LLMs) like GPT-3 and GPT-4. These models can generate human-like text and perform various complex NLP tasks. LLMs are increasingly studied in healthcare for different applications, including aiding diagnosis, streamlining administrative tasks, and enhancing medical education [5,6,7,8]. It is critical to assess the performance of these models in a standardized manner, specifically in specialized fields like medicine [6, 9].

Research has extensively assessed the performance of LLMs, particularly GPT, across a variety of medical examinations. For instance, studies have evaluated these models on specialty exams such as the American Board of Family Medicine annual In-Training Exam (ITE) [10], as well as American Board exams in a variety of specialties such as cardiothoracic surgery [11], rhinology [12], anesthesiology [13] and orthopedics [14]. In each of these contexts, GPT-4 has demonstrated impressive results, often achieving high accuracy rates comparable to those of human test-takers.

Beyond the United States, LLMs have been evaluated in various international contexts. In the United Kingdom, GPT models were evaluated against the Membership of the Royal Colleges of Physicians (MRCP) written exams, where they showed potential in clinical decision-making [15, 16]. Studies in Germany [17], China [18], and Japan [19, 20] have similarly tested these models against respective national medical licensing exams, with findings often indicating that LLMs can perform at or near the level of medical students, further emphasizing their global applicability and effectiveness.

Additionally, GPT models have been explored for their ability to generate medical exam questions, a testament to their adaptability and the broad spectrum of their applicational potential in medical education [21, 22].

Given the important role of the USMLE in assessing medical competence, understanding LLMs abilities on this test offers valuable insights into their clinical reasoning, potential applications, and limitations in healthcare. Thus, the aim of our study was to systematically review the literature on the performance of GPT on official USMLE questions, analyze its clinical reasoning capabilities, and determine the impact of various prompting methodologies on outcomes.

In the following sections, we first outline the materials and methods used in this systematic review, detailing our literature search strategy and eligibility criteria. Next, we present the results, highlighting the accuracy of different GPT models across various USMLE steps. We then discuss the implications of these findings, considering the potential applications of GPT models in medical education and practice. Finally, we address the limitations of our study and propose directions for future research.

2 Materials and methods

2.1 Literature search

A systematic literature search was conducted for studies on GPT models’ performance on the USMLE.

We searched PubMed/MEDLINE database for articles published up to December 2023. Search keywords included “USMLE”, “United Stated Medical License Examination”, “ChatGPT”, “Large language models” and “OpenAI”. We also searched the references lists of relevant studies for any additional relevant studies. Figure 1 presents a flow diagram of the screening and inclusion process.

Fig. 1
figure 1

Flow diagram of the search and inclusion process in the study. The study followed the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) guidelines. USMLE = United States Medical Licensing Examination. LLMs = Large Language Models

2.2 Eligibility criteria

We included full publications in English that evaluated the performance of GPT models on USMLE questions. We excluded papers that evaluate unofficial sources of USMLE-like questions (e.g., MedQA).

2.3 Data sets

The USMLE is a three-step exam composed of multiple-choice questions, designed to assess a physician’s knowledge and skills required for medical practice. Step 1 evaluates the examinee's understanding of basic science principles related to health, disease, and therapeutic mechanisms, with questions requiring data interpretation, including pathology and pharmacotherapy. Step 2 Clinical Knowledge (CK) assesses the application of medical knowledge and clinical science in patient care, with a focus on disease prevention and health promotion. Step 3 tests the ability to apply medical and clinical knowledge in the unsupervised practice of medicine, particularly in patient management in ambulatory settings [1].

There are two official sources for USMLE questions—USMLE Sample exam, which is freely available [23], and the NBME Self-Assessment, available for purchase at the NBME website [24]. Both include questions for Step 1, 2CK and 3. The USMLE sample exam includes 119, 120 and 137 questions for Step 1, 2CK and 3, respectively. The NMBE Self-Assessment includes 1197, 800 and 176 questions for Steps 1, 2CK and 3 [25].

2.4 Large language models

The large language models included in this study were all developed by OpenAI [26].

GPT-3 is an autoregressive model known for its ability to handle a variety of language tasks without extensive fine-tuning. GPT-3.5, a subsequent version, serves as the foundation for both ChatGPT and InstructGPT. ChatGPT is tailored to generate responses across diverse topics, while InstructGPT is designed to provide detailed answers to specific user prompts.

Both models, although sharing a foundational architecture, have been fine-tuned using different methodologies and datasets to cater to their respective purposes. GPT-4, though specifics are not fully disclosed, is recognized to have a larger scale than its predecessor GPT-3.5, indicating improvements in model parameters and training data scope [27,28,29].

2.5 Screening and synthesis

This review was reported according to the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [30].

3 Results

Data were extracted from six publications that evaluated the performance of GPT models on USMLE questions. The parameters evaluated in each publication are described in Table 1.

Table 1 Publications reporting on performance of GPT Models in USMLE questions

3.1 Large language models

When evaluated on USMLE questions, GPT-4 outperformed all other models with accuracy rates of 80–88%, 81–89% and 81–90% in Step1, Step2CK and Step3, respectively. When tested on 21 soft skills questions from the 3 Steps together, it had 100% accuracy rate [31]. ChatGPT also had relatively good results, outperforming GPT-3, InstructGPT and GPT-3.5 with accuracy rates of 41–75%, 49–75% and 46–68% in Step1, Step2CK and Step3, respectively (Tables 2, 3 and 4). On soft skills questions, ChatGPT had accuracy of 66.6%.

Table 2 GPT Models’ Accuracy (%) on USMLE Step1
Table 3 GPT Models’ Accuracy (%) on USMLE Step2
Table 4 GPT Models’ Accuracy (%) on USMLE Step3

3.2 Questions with media elements

Some of the USMLE questions use media elements such as graphs, images, and charts (14.4% and 13% of questions in the Self-assessment and Sample exam, accordingly) [25]. While most studies excluded these questions [27, 32, 33], two studies included them in the evaluation. A collaboration research between Microsoft and OpenAI [25], found that while GPT-4 performs best on text-only questions, it still performs well on questions with media elements, with 68–79% accuracy, despite not being able to see the relevant images. This research was conducted before the release of multimodal GPT-4 (GPT-4v), that can receive and analyze visual input [34]. It is reasonable to assume that the results might improve with GPT-4v. Another study reported similar pattern with ChatGPT, showing better accuracy with text-only items compared to items with non-text components [35].

3.3 Prompting methods

Different prompting methods were used to test the performance of the LLMs. Examples of prompts are shown in Fig. 2.

Fig. 2
figure 2

Prompt templates used to assess USMLE questions. Elements between <  > are replaced with question-specific data. USMLE = United States Medical Licensing Examination

Kung et al.[32] tested three prompting options. In the Open-Ended format, answer choices were eliminated, and variable lead-in interrogative phrases were incorporated to mirror natural user queries. In the Multiple-Choice Single Answer without Forced Justification format, USMLE questions were reproduced exactly. The third format, Multiple Choice Single Answer with Forced Justification, required ChatGPT to provide rationales for each answer choice.

Nori et al.[25] also used multiple choice questions but tested both zero-shot and few-shot prompting. Zero-shot prompting requires a model to complete tasks without any prior examples, utilizing only the knowledge gained from pre-training. On the other hand, few-shot prompting provides the model with limited examples of the task at hand before execution. The model is expected to generalize from these examples and accurately perform the task.

Overall, the performance of the models was slightly affected by the prompting method, with open-ended prompting showing better results than multiple choice, and 5-shot prompting giving better results than zero-shot.

3.4 Performance assessment

All papers assessed accuracy of the models in answering USMLE questions. Two studies also included qualitative assessment of the answers and explanations provided by the LLMs. Gilson et al.[27] assessed each answer for logical reasoning (identification of the logic in the answer selection), use of information internal to the question (information that is directly provided within the question itself), and use of information external to the question (information that is not contained within the question). ChatGPT was reported to use information internal to the question in 97% of questions. The use of information external to the question was higher in correct (90–93%) answers than incorrect answers (48–63%).

In addition, every incorrect answer was labeled for the reason of the error: logical error (the response uses the pertinent information but does not translate it to the correct answer), information error (did not identify the key information needed) or statistical error (an arithmetic mistake).

Logical errors were the most common, found in 42% of incorrect answers [27]. Kung et al.[32] evaluated each output for concordance and insight, by two physician reviewers. A high concordance of 94.6% was found across ChatGPT’s answers. ChatGPT produced at least one significant insight in 88.9% of questions. The density of insight contained within the explanations provided by ChatGPT was significantly higher in questions answered accurately than in incorrect answers.

3.5 Content-based evaluation

Two studies discussed performance in questions involving specific topics. Yaneva et al.[35] found that ChatGPT performed significantly worse on items relating to practice-based learning, including biostatistics, epidemiology, research ethics and regulatory issues.

Brin et al.[31] tested only questions including communication skills, professionalism, legal and ethical issues, and reported superiority of GPT-4 in these topics.

3.6 Consistency

Various studies have examined the aspect of response consistency. Yaneva et al. reported intra-item inconsistencies in ChatGPT, with a variation noted in 20% of the USMLE sample items upon thrice replicating each question [35].

Brin et al. explored the model's tendency for self-revision, when asked “are you sure” after each response and discovered a significant alteration rate (82.5%) in initial responses of ChatGPT. GPT-4, however, showed 0% change rate [31].Mihalache et al. used two different internet browsers to input questions and observed consistent GPT-4 performance [33].

4 Discussion

This review provides a comparative analysis of GPT models performance on USMLE questions. While GPT-4 secured accuracy rates within the 80–90% range, ChatGPT demonstrated competent results, outpacing the capabilities of previous models, GPT-3, InstructGPT, and GPT-3.5.

The results of this review show that the main factor that affects performance, is the inherent capabilities of the LLM. Other factors, including various prompting methods, incorporation of questions that include media, and variability in question sets had secondary roles.

This observation emphasizes the priority of advancing core AI model development to ensure better accuracy in complex sectors like healthcare.

Prompting is considered to hold a significant role in shaping the performance of LLMs when answering queries [36, 37]. This review demonstrates that the way questions are structured can subtly influence the responses generated by these models. Notably, open-ended prompting has a slight edge over the standard multiple-choice format. This suggests that LLMs might have a subtle preference when processing information based on the context they're provided. Moreover, the marginally better outcomes with 5-shot prompting compared to zero-shot hint at the LLMs' capacity to adjust and produce informed answers when given a few guiding examples. Though these differences are subtle and may not dramatically change the overall performance, they provide insights into the optimization of interactions with LLMs.

The evaluation of AI consistency in medical knowledge assessment, as reflected in these studies, raises important discussions about the reliability and application of AI in medical education and practice. The variability in ChatGPT's responses, as seen in Yaneva et al.’s study [35], and its tendency to revise answers, noted by Brin et al.[31], point to inconsistency in AI-generated responses. Despite this inconsistency, Yaneva et al. notes that ChatGPT appears equally confident whether its answer is correct or incorrect, limiting its use as a learning aid for medical students [35].

When looking into accuracy rates for Step1 sample exam questions, the accuracy rate of ChatGPT is 36.1–69.6% [27, 32, 35]. The low accuracy rate of 36.1%, which is below the passing level of 60%, is reported when calculated with indeterminate responses included (described by the authors as responses in which the output is not an answer choice, or that determine that not enough information is available) [32]. It is noteworthy that different articles considered different number of questions as text-only items, which affected the results. When tested on the same set of questions, there is less variability (66.7–69.9% accuracy for Step1, only-text items, multiple choice prompting) [35]. Meanwhile, GPT-4 showed higher consistency [31] and a stable performance across different platforms [33]. For the same conditions described above, the accuracy rate of GPT-4 is 80.7–88% [33, 35]. These insights emphasis the importance of continuous evaluation and improvement of AI tools to ensure not just accuracy, but also stability and reliability, which are vital for their effective integration into medical training and assessment.

Two sets of formal USMLE questions were utilized in the studies reviewed. Both GPT-3.5 and GPT-4 exhibited superior performance on the Sample exam compared to the Self-assessment. While the Sample exam is publicly accessible, the Self-assessment can only be obtained through purchase. This raises the possibility that the higher accuracy is derived from previous encounters of the models with questions from the Sample exam. Nori et al.[25] developed an algorithm to detect potential signs of data leakage or memorization effects. This algorithm is designed to ascertain if specific data was likely incorporated into a model's training set. Notably, this method did not detect any evidence of training data memorization in the official USMLE datasets, which include both the Self-assessment and the Sample exam. However, it is important to note that while the algorithm demonstrates high precision (positive predictive value, PPV), its recall (sensitivity) remains undetermined. Therefore, the extent to which these models might have been exposed to the questions during their training remains inconclusive.

The proficiency of LLMs on a foundational examination such as the USMLE provides an indication of these models’ potential role in the medical domain. GPT-4's ability to achieve high accuracy levels signifies the progression of AI's capabilities in deciphering complex medical knowledge. Such advancements could be central in assisting healthcare professionals, improving diagnostic accuracy, and facilitating medical education. For medical students, LLMs could serve as supplementary tools for studying and understanding complex medical topics [38]. These models can provide instant feedback on medical practice questions and foster a deeper understanding through personalized learning experiences. By presenting varied question formats and explanations, LLMs can help students identify knowledge gaps and reinforce learning. While LLMs can offer interactive experiences that textbooks cannot, it's important to recognize their limitations and use them in addition to traditional studying methods, rather than as a replacement [39,40,41,42]. LLMs apparent proficiency in clinical knowledge raises the question of to what extent these models could be integrated into the clinical setting and replace conventional point of care tools. Currently, LLMs like GPT-4 are not reliable enough to replace existing clinical decision-making processes. However they can augment existing tools and methods by providing preliminary insights, secondary opinion and possible diagnoses [43,44,45,46,47,48]. As for patients, the publicly available LLMs can provide immediate information regarding health-related questions. This introduces both opportunities and challenges, due to the risk of misinformation [49,50,51,52]. The integration of LLMs in medical practices should be approached with caution. The USMLE's textual nature might not encompass the entire scope of clinical expertise, where skills like patient interaction [53], hands-on procedures, and ethical considerations play an important role.

In surveying the expansive literature on the application of LLMs in healthcare, only three studies are directly comparable in their use of formal question sets. This observation highlights the potential disparity in evaluation methods and emphasizes the need for standardized benchmarks in assessing LLMs' medical proficiency. The USMLE stands as a primary metric for evaluating medical students and residents in the U.S., and its role as a benchmark for LLMs warrants careful consideration. While it offers a structured and recognized platform, it is crucial to contemplate whether such standardized tests can fully encapsulate the depth and breadth of LLMs' capabilities in medical knowledge. Future research should explore alternative testing mechanisms, ensuring a comprehensive and multidimensional evaluation of LLMs in healthcare.

4.1 Limitations

This systematic review has several limitations. First, the studies reviewed primarily focused on multiple choice questions, which, although a prevalent format in the USMLE and an accepted method for assessing medical knowledge among students and clinicians, may not fully capture the complexity of real-world medical scenarios. Actual clinical cases often present with complexities and subtleties that might not strictly align with textbook descriptions. Hence, when contemplating the applicability of LLMs in a clinical setting, it is vital to recognize and account for this disparity. Secondly, our review intentionally excluded studies that examined other datasets used for USMLE preparation, such as MedQA. This decision was made to maintain consistency in the comparison of question sets. However, there is a wide array of research that evaluates various other LLMs using diverse question sets that mimic USMLE questions and measure medical proficiency. We also excluded studies that tested other LLMs, such as Med-PaLM. The exclusion of these studies potentially limits the comprehensiveness of our insights into the capabilities of LLMs in the medical domain.

5 Conclusion

This systematic review aimed to assess the performance of GPT models on the USMLE, the official exam of medical students and residents in the US. The six papers included in the review evaluated the performance of several GPT models, based on two official USMLE question sets. We found that both ChatGPT and GPT-4 achieved a passing rate, with GPT-4 outperforming its predecessors and achieving accuracy rates of 80–90%. The analysis showed that while prompting methods, such as open-ended formats, can slightly influence performance, the model’s inherent capabilities remain the primary determinant of success. This review highlights the continuous improvement in performance with newer GPT models. While all included papers used the same question sets, LLM evaluation varied due to differences in prompting methods, inclusion or exclusion of media elements, and the approach to answer analysis. The proficiency of GPT-4 in tackling USMLE questions suggests its potential use in both medical education and clinical practice. However, as this review is limited by the small number of published papers, further research is required, and ongoing assessments of LLMs against trusted benchmarks are essential to ensure their safe and effective integration into healthcare.