Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Pompili, David; Richa, Yasmina; Collins, Patrick; Richards, Helen; Hennessey, Derek B

doi:10.1007/s00345-024-05146-3

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Research
Open access
Published: 29 July 2024

Volume 42, article number 455, (2024)
Cite this article

Download PDF

You have full access to this open access article

World Journal of Urology Aims and scope Submit manuscript

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Download PDF

David Pompili¹,
Yasmina Richa¹,
Patrick Collins²,
Helen Richards^1,3 &
…
Derek B Hennessey ORCID: orcid.org/0000-0002-7372-0100^1,2

673 Accesses
Explore all metrics

Abstract

Purpose

Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content. The potential benefits of LLMs in healthcare is predicted to be immense. The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics.

Methods

Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate). PILs were evaluated using a quality assessment checklist. PIL readability was assessed by the Average Reading Level Consensus Calculator.

Results

PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images. Medical inaccuracies were present in all generated content including instances of significant error. Readability analysis identified PaLM 2 generated PILs as the simplest (age 14–15 average reading level). Llama 2 PILs were the most difficult (age 16–17 average).

Conclusion

While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images. LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design. How satisfied patients are to LLM-generated PILs remains to be evaluated.

Performance of large language models (LLMs) in providing prostate cancer information

Article Open access 23 August 2024

Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients

Article Open access 19 June 2024

Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD

Article 12 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The advent and improvement of large language models (LLMs) and artificial intelligence (AI) generated information and content presents many exciting avenues by which clinical and non-clinical workloads may be reduced or improved [1]. LLMs are classified as generative AIs, which are trained to understand patterns and relationships between words and apply them to produce human-like responses to a given prompt [2]. Popular LLMs developed since 2022 include: OpenAI’s Generative Pre-Trained Transformer 4 (ChatGPT-4), Google’s Pathways Language Models 2 (PaLM 2) and Meta’s Large Language Model Meta AI (Llama 2).

While there exist numerous potential applications for AI in healthcare, the adaptive capabilities of generative AIs enable them to tailor the language, content, and style of their outputs, thereby potentially aiding in the communication of complex medical information to patients. Emerging evidence suggests that LLMs possess the capacity to produce empathetic responses to patients’ questions, which may be deemed preferable compared to physician-made responses [3,4,5,6,7,8]. Additionally, the medical accuracy of ChatGPT-4 and other LLMs is a topic of active research across many disciplines. In the context of urology, studies have shown both strong [9, 10] and poor [11] performance in providing evidence-based answers to short questions and clinical vignettes. However, the potential utility of LLMs in generating extended-form information for patients with urological conditions and procedures remains to be explored. Furthermore, existing literature primarily focuses on ChatGPT, leaving a gap to investigate how other novel LLMs perform in comparison.

This study aims to explore the ability of multiple mainstream LLMs (ChatGPT-4, PaLM 2, and Llama 2) to generate accurate patient information leaflets (PILs) on urological topics. In addition, given the growing importance of information accessibility for patient populations with varying degrees of education and literacy, we also investigated the readability of PILs generated by each LLM.

Methods

Patient information leaflet generation

This study was conducted in November 2023, ethical approval was granted by the Social Research Ethics Committee (SREC) of University College Cork Medical School. Four common surgeries and conditions were selected for this study, including circumcision, nephrectomy, overactive bladder syndrome (OAB) and transurethral resection of the prostate (TURP). These were chosen to capture a fair representation of the medical and surgical aspects of urology, encompassing the spectrum from benign to malignant surgery, simple to complex procedures and disease specific information. Three of the most popular LLMs were then selected for assessment, based on popularity and access, including OpenAI’s ChatGPT-4, Meta’s Llama 2, and Google’s PaLM 2 (Google Bard). A comprehensive prompt was written for each condition asking each LLM to generate a medically accurate PIL that was understandable to a layperson. The prompts were structured as follows: “Imagine you are a panel of experienced urologists and clinicians tasked to develop a patient information leaflet for (selected procedure/condition). Please ensure that the leaflet is medically accurate and based on current best practices/guidelines for urology. Please use clinical terminology while ensuring the leaflet is understandable to a layperson. Be reassuring and include images/diagrams where applicable”. Additional instructions within the prompt included using the headings provided in the associated leaflet produced by the European Association of Urology and including as much information as possible under each sub-heading. Examples of such instructions were “include all benefits, risks and potential complications of the procedure” and “include descriptions of what to expect both pre- and post-operatively and the steps patients can take to be active players in their care towards optimising patient outcomes.“. This prompt was tested multiple times in fresh sessions to assess for variability in LLM output. As there was minimal variability, a fresh session was then used and the first PIL provided by topic was recorded for subsequent scoring and readability analysis.

PIL quality scoring

PILs were evaluated across 20 criteria using a previously developed quality checklist [12] using a 5-point Likert scale (0 = not applicable, 1 = strongly disagree, 2 = disagree, 3 = neither agree nor disagree, 4 = agree, 5 = strongly agree). The mean total quality checklist score was calculated by summing the 20 checklist items and dividing by 20.

PILs were copied with their original formatting as per the output by each LLM into separate Microsoft Word documents and labelled as version A, B or C. PIL scoring was undertaken in a single-session quality consensus meeting by a panel that was blinded to which LLM generated each PIL. The panel consisted of clinicians with varying degrees of urology training: three interns, three junior residents, three senior residents, and one consultant urologist. Each PIL in turn was read, discussed, and rated by consensus of the panel. In each evaluation, to reach consensus, panel members provided a score for each checklist criterion (Table 1) followed by active discussion of the pertinent points. Where disagreements occurred, the average score was recorded.

Table 1 Consensus scores for each LLM-generated PIL based on checklist quality criteria adapted from Sustersic et al. 2017 [12]

Full size table

Readability assessment

Following the quality consensus meeting, each LLM-generated PIL was assessed for reading difficulty using the Average Reading Level Consensus Calculator so as not to influence the scoring by the panel. This online calculator (freely available at https://readabilityformulas.com/calculator-arlc-formula.php) takes the average of 7 popular readability formulas (Automated Readability Index, Flesch Reading Ease, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Readability Index, SMOG Index, Linsear Write Readability Index) and produces a difficulty score based on grade levels ranging from extremely easy (first grade, age 6–7) to extremely difficult (college graduate, age 23+).

Statistical analysis

Statistical analysis was performed using GraphPad Prism for Windows, version 10. Descriptive statistics were used to describe the data. A one-way ANOVA was performed by urological topic to assess for differences between PILs produced by each LLM.

Results

PIL quality scores across LLMs

PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08). Overall, Google’s PaLM 2 generated PILs scored higher than Llama 2 and ChatGPT-4 PILs in all topics except TURP. In this case, Llama 2 achieved the highest mean score (Fig. 1). Circumcision PILs achieved the highest mean quality score of the 4 leaflets generated by PaLM 2 (3.95) and ChatGPT-4 (3.55). In the case of Llama 2, TURP and circumcision PILs achieved the highest quality mean score (3.5) (Fig. 1). There were no statistically significant differences in quality scores observed between the PILs generated by each LLMs across topics (Table 1).

Medical accuracy of LLM-generated PILs

The medical accuracy of the PILs was assessed using scores from the evidence-based medicine criterion (item 1 of the quality checklist, Table 1). The OAB and circumcision PILs had no major errors and contained appropriate evidence-based information. Nephrectomy PILs scored the lowest across LLMs, with considerable errors and inaccurate information (Table 1). This included citing “kidney stones” (PaLM 2) or “an enlarged prostate” (Llama 2) as indications for nephrectomy.

In terms of the LLM performance, the mean evidence-based medicine checklist criterion scores for PaLM 2 and Llama 2 were the same across PILs at 3.25, outperforming ChatGPT-4’s mean score of 3 (Table 1).

Inclusion of images

PaLM 2 was the only LLM to include images in its outputs. These images were taken from online web pages and included topic-specific anatomical diagrams or graphics illustrating aspects of the surgical procedure. While they did not include images, ChatGPT-4 and Llama 2 indicated where they would have included a figure (e.g. “Insert diagram showing TURP procedure, prostate gland, urethra and bladder.”), and briefly described in text what that figure would be showing to the reader (e.g. “Diagram: Illustration of the male urinary system showing the location of the prostate gland and the narrowing of the urethra due to BPH”).

Readability assessment and word count

PILs generated by PaLM 2 were found to be the most readable, corresponding to a 9th Grade level (ages 14–15). Llama 2 PILs were the most difficult to read, representing an 11th grade (ages 16–17) level. Llama 2 PILs were consistently the longest, with two PILs exceeding 1000 words each. The OAB Llama 2 PIL was the longest at 2037 words, representing an almost 3-fold length increase compared to the next longest OAB PIL (Table 2).

Table 2 Average reading level consensus calculator scores and word count of PILs

Full size table

Discussion

Integrating LLMs into healthcare settings holds the promise of improving how information is communicated to patients. LLMs such as ChatGPT-4, PaLM 2 (Google), and Llama 2 (Meta) have demonstrated capabilities in understanding, summarizing, and generating content. This study is the first to compare different LLMs in generating PILs in urology. Examining the LLM-generated PILs provides insights into the potential benefits, challenges, and necessary considerations for implementing LLMs in healthcare communication.

Our results reveal variations in the performance of different LLMs in generating PILs for urological topics. Among the three LLMs assessed, PaLM 2 emerges as the superior LLM, achieving the highest overall scores on the PIL checklist criteria. This suggests that PaLM 2’s outputs were generally perceived as more comprehensive and aligned with the quality checklist compared to ChatGPT-4 and Llama 2. PaLM 2 was the only LLM to incorporate images into its PILs, potentially enhancing understanding. However, the study also highlights an important caveat: the varying degrees of error in the medical accuracy of LLM-generated content. Thus, it is crucial to acknowledge the need for clinician oversight to ensure the accuracy of information provided by LLMs, especially in the context of providing patient information.

A number of studies have focused on the capabilities of ChatGPT-4 in the context of patient education [10, 13], in addition to its limitations [14]. In our study, ChatGPT-4 generated PILs had the lowest quality ratings amongst the three LLMs assessed. This underscores the importance of empirical testing to evaluate LLMs in a given context as their performance may not always align with expectations based on their utility in other applications.

In addition to evaluating the quality of the PILs, we conducted a readability analysis of their content. All PILs generated by the 3 LLMs examined in our study exceeded average literacy levels of Americans, which is suggested to be at the 7th to 8th grade level (12 to 14 years old) [15]. This finding is commensurate with previous studies demonstrating that the quality and readability of ChatGPT-4 responses had heightened complexity, surpassing optimal health communication readability [16], and that ChatGPT-4 generated materials were less accessible, with longer and more complex responses compared to traditional patient education materials [17]. Moreover, a recent report comparing 5 LLMs echoed our results, finding Google Bard (PaLM 2) to produce the most readable information [18], albeit still in excess of 7th to 8th grade level as in our study. Altogether, these data underscore that while LLMs can automate the generation of content, it is essential to balance providing comprehensive information and ensuring readability and thus accessibility for effective patient communication.

Interestingly, our study suggests that the complexity of medical topics may significantly influence the quality of generated PILs. The contrast in scores between PILs on circumcision and nephrectomy suggests a potential advantage in focusing on simpler subjects for effective PIL generation. Moreover, examining nephrectomy as a broad topic highlights that a more specific focus, such as a particular type of nephrectomy, may have yielded more targeted, accurate and informative content.

It is also prudent to consider the ease of use for each LLM interface. Both ChatGPT-4 and PaLM 2 had similarly accessible and simple online interfaces. However, Meta does not provide an online interface for Llama 2; instead, it requires downloading or use of a third-party online interface not produced by Meta, thus providing a barrier to entry for anyone seeking to use it. Understanding the user-friendliness and accessibility of these LLMs is crucial for practical implementation in healthcare.

While this study provides valuable insights, certain limitations should be acknowledged. Our evaluation was confined to a specific set of urological topics, and the findings relating to quality of LLM-generated PILs may not be generalizable to other medical specialties. Additionally, ongoing advancements in LLMs may influence outcomes, such as accuracy of information provided and inclusion of images, necessitating periodic reassessment. Although not explicitly tested in our study, future investigations might benefit from refining prompt design by opting for smaller, iterative, and sequential prompts to potentially enhance PIL quality and readability. Moreover, diversifying the PIL evaluation checklist to include specific medical accuracy and readability criteria could provide a more comprehensive assessment. Finally, we acknowledge the absence of patient involvement in the rating of the LLM-generated PILs. Further studies would be well placed including patient ratings of the acceptability and satisfaction with LLM-generated PILs. Despite these limitations, our study sheds light on the potential role of LLMs in generating PILs in the urology setting and thus alleviating aspects of the associated workload on healthcare professionals.

Conclusion

In conclusion, this study provides valuable insights into the potential of LLMs, specifically ChatGPT-4, PaLM 2, and Llama 2, in generating PILs in urology. While these LLMs demonstrate the capacity to automate the creation of PILs, thereby reducing the workload of healthcare professionals, caution is warranted. Clinician input remains indispensable for ensuring medical accuracy and aligning readability levels with the intended lay patient audience. As the integration of LLMs in healthcare progresses, collaborative approaches that leverage both AI and human expertise will likely define the future landscape of patient medical communication.

Data availability

Available on request.

Code Availability

Not applicable.

References

Secinaro S, Calandra D, Secinaro A et al (2021) The role of artificial intelligence in healthcare: a structured literature review. BMC Med Inf Decis Mak 21:125. https://doi.org/10.1186/s12911-021-01488-9
Article Google Scholar
Singhal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 620:172–180. https://doi.org/10.1038/s41586-023-06291-2
Article CAS PubMed PubMed Central Google Scholar
Wei L, Mohammed ISK, Francomacaro S, Munir WM (2024) Evaluating text-based generative artificial intelligence models for patient information regarding cataract surgery. J Cataract Refractive Surg 50:95. https://doi.org/10.1097/j.jcrs.0000000000001288
Article Google Scholar
Stroop A, Stroop T, Zawy Alsofy S et al (2023) Large language models: are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J. https://doi.org/10.1007/s00586-023-07975-z
Article PubMed Google Scholar
Potapenko I, Boberg-Ans LC, Stormly Hansen M et al (2023) Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol 101:829–831. https://doi.org/10.1111/aos.15661
Article PubMed Google Scholar
Kianian R, Carter M, Finkelshtein I et al (2023) Application of Artificial Intelligence to patient-targeted Health information on kidney Stone Disease. J Ren Nutr S1051–2276(23)00169–3. https://doi.org/10.1053/j.jrn.2023.10.002
Article Google Scholar
Hillmann HAK, Angelini E, Karfoul N et al (2023) Accuracy and comprehensibility of chat-based artificial intelligence for patient information on atrial fibrillation and cardiac implantable electronic devices. Europace 26:euad369. https://doi.org/10.1093/europace/euad369
Article PubMed PubMed Central Google Scholar
Ayers JW, Poliak A, Dredze M et al (2023) Comparing physician and Artificial Intelligence Chatbot responses to patient questions posted to a Public Social Media Forum. JAMA Intern Med 183:589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Article PubMed PubMed Central Google Scholar
Zhou Z, Wang X, Li X, Liao L (2023) Is ChatGPT an evidence-based Doctor? Eur Urol 84:355–356. https://doi.org/10.1016/j.eururo.2023.03.037
Article PubMed Google Scholar
Gabriel J, Shafik L, Alanbuki A, Larner T (2023) The utility of the ChatGPT artificial intelligence tool for patient education and enquiry in robotic radical prostatectomy. Int Urol Nephrol 55:2717–2732. https://doi.org/10.1007/s11255-023-03729-4
Article PubMed Google Scholar
Cocci A, Pezzoli M, Lo Re M et al (2023) Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 1–6. https://doi.org/10.1038/s41391-023-00705-y
Sustersic M, Gauchet A, Foote A, Bosson J (2017) How best to use and evaluate patient information leaflets given during a consultation: a systematic review of literature reviews. Health Expect 20:531–542. https://doi.org/10.1111/hex.12487
Article PubMed Google Scholar
Lockie E, Choi J (2023) Evaluation of a chat GPT generated patient information leaflet about laparoscopic cholecystectomy. ANZ J Surg. https://doi.org/10.1111/ans.18834
Article PubMed Google Scholar
McCarthy CJ, Berkowitz S, Ramalingam V, Ahmed M (2023) Evaluation of an Artificial Intelligence Chatbot for delivery of IR Patient Education Material: a comparison with societal website content. J Vasc Interv Radiol 34:1760–1768e32. https://doi.org/10.1016/j.jvir.2023.05.037
Article PubMed Google Scholar
Cutilli CC, Bennett IM (2009) Understanding the Health Literacy of America Results of the National Assessment of Adult Literacy. Orthop Nurs 28:27–34. https://doi.org/10.1097/01.NOR.0000345852.22122.d6
Article PubMed PubMed Central Google Scholar
Temel MH, Erden Y, Bağcıer F (2024) Information quality and readability: ChatGPT’s responses to the most common questions about spinal cord Injury. World Neurosurg 181:e1138–e1144. https://doi.org/10.1016/j.wneu.2023.11.062
Article PubMed Google Scholar
Shah YB, Ghosh A, Hochberg AR et al (2024) Comparison of ChatGPT and Traditional Patient Education Materials for Men’s Health. Urol Pract 11:87–94. https://doi.org/10.1097/UPJ.0000000000000490
Article PubMed Google Scholar
Şahin MF, Ateş H, Keleş A et al (2024) Responses of five different Artificial Intelligence Chatbots to the top searched queries about Erectile Dysfunction: a comparative analysis. J Med Syst 48:38. https://doi.org/10.1007/s10916-024-02056-0
Article PubMed PubMed Central Google Scholar

Download references

Funding

None.

Open Access funding provided by the IReL Consortium

Author information

Authors and Affiliations

School of Medicine, University College Cork, Cork, Ireland
David Pompili, Yasmina Richa, Helen Richards & Derek B Hennessey
Department of Urology, Mercy University Hospital, Cork, Ireland
Patrick Collins & Derek B Hennessey
Department of Clinical Psychology, Mercy University Hospital, Cork, Ireland
Helen Richards

Authors

David Pompili
View author publications
You can also search for this author in PubMed Google Scholar
Yasmina Richa
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Collins
View author publications
You can also search for this author in PubMed Google Scholar
Helen Richards
View author publications
You can also search for this author in PubMed Google Scholar
Derek B Hennessey
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D Pompili: Project development, literature review, data collection and analysis, manuscript writing and editing. Y Richa: Project development, literature review, manuscript writing and editing. P Collins: Manuscript editing. H Richards: Project development, manuscript editing. DB Hennessey: Project development and oversight, manuscript editing.

Corresponding author

Correspondence to Derek B Hennessey.

Ethics declarations

Ethics approval

Granted by the Social Research Ethics Committee (SREC) of University College Cork Medical School.

Consent to participate

None required.

Consent for publication

All authors consent for submission.

Conflicts of interest/Competing interests

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pompili, D., Richa, Y., Collins, P. et al. Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models. World J Urol 42, 455 (2024). https://doi.org/10.1007/s00345-024-05146-3

Download citation

Received: 19 March 2024
Accepted: 23 June 2024
Published: 29 July 2024
DOI: https://doi.org/10.1007/s00345-024-05146-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Performance of large language models (LLMs) in providing prostate cancer information

Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients

Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD

Explore related subjects

Introduction

Methods

Patient information leaflet generation

PIL quality scoring

Readability assessment

Statistical analysis

Results

PIL quality scores across LLMs

Medical accuracy of LLM-generated PILs

Inclusion of images

Readability assessment and word count

Discussion

Conclusion

Data availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflicts of interest/Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation