Attention: Restrictions on use of AUA, AUAER, and UCF content in third party applications, including artificial intelligence technologies, such as large language models and generative AI.
You are prohibited from using or uploading content you accessed through this website into external applications, bots, software, or websites, including those using artificial intelligence technologies and infrastructure, including deep learning, machine learning and large language models and generative AI.

ARTIFICIAL INTELLIGENCE Artificial Intelligence Chatbots: How Accurate Is the Information?

By: Bristol B. Whiles, MD, University of Kansas Medical Center, Kansas City; Russell S. Terry, MD, University of Florida College of Medicine, Gainesville | Posted on: 05 Jan 2024

Within a week of its release in November 2022, OpenAI’s ChatGPT large language model (LLM) had already provided millions of internet users with unprecedented access to a transformative artificial intelligence (AI) tool.1 While prior advanced LLMs had existed for some time, ChatGPT’s novelty was its user-focused interface and conversational communication style—one no longer needed to have a background in computer programming to meaningfully use the technology. The ChatGPT chatbot platform is remarkable for its ability to provide confident-sounding answers to almost any query about almost any topic, including health care and urology. ChatGPT met the needs of our interconnected world, providing us with information that aligned with the persistent desire for instantaneous access.

Health care providers quickly began using this technology and testing its ability to perform rote tasks such as creating call schedules, drafting prior authorization letters to insurance companies, and formatting replies to messages in the electronic medical record. When a global cohort of 456 urologists was surveyed in April 2023 about their LLM use and experiences, 48% of them reported using LLMs in academic practice for tasks such as idea generation.2 Nearly 20% of respondents also reported using ChatGPT in clinical practice, primarily for patient education applications. Interestingly, while equal numbers of respondents (29.6%) reported either trusting or not trusting LLMs to provide accurate information, 78% and 56% of respondents believed that LLMs could play an important role in academic and clinical practice, respectively. Considering the increased uptake among urologists, our patients are also likely to begin utilizing these AI tools. It is therefore important for us to understand the accuracy and limitations of current AI chatbot outputs, especially for urologic health care advice.

Table. Summary Table Containing Many of the Published Studies That Have Examined Large Language Models in Urology

Article Large language model Urologic topic evaluated Scoring system Accuracy, appropriateness, or correctness evaluation Other info
Caution! AI Bot Has Entered the Patient Chat: ChatGPT Has Limitations in Providing Accurate Urologic Healthcare Advice3 ChatGPT (3.5)
February 13, 2023
Nononcology Brief DISCERN
Yes or no for appropriateness
60% Appropriate 92.3% Had ≥1 incorrect, misinterpreted, or nonfunctional citation
Quality content in 54% of responses
Evaluating the Effectiveness of Artificial Intelligence Powered Large Language Models Application in Disseminating Appropriate and Readable Health Information in Urology4 ChatGPT (3.5)
February 28, 2023
Oncology
Nononcology
Emergency
5-Point Likert scale for appropriateness
Utilized scoring aspects from DISCERN and QUEST tools
78% Appropriate
100% Accurate
61% Compressive
College reading level (Flesch Reading Ease of 35.5 ± 10.2 and Flesh-Kincaid Reading Grade Level 13.5 [SD = 1.74])
Quality of Information and Appropriateness of ChatGPT Outputs for Urology Patients5 ChatGPT (3.5)
April-May 2023
Oncology
Nononcology Emergency
5-Point Likert scale for accuracy, comprehensiveness, or appropriateness
Section 2 of the DISCERN tool for information quality
52% Appropriate
Nononcology > oncology or emergency questions (59% vs 53% vs 11%; P = .03)
Poor quality via DISCERN assessment
College graduate reading level (Flesch Reading Ease of 18 [IQR = 21] and Flesch-Kincaid Reading Grade Level 15.8 [IQR = 3])
How Well Do Artificial Intelligence Chatbots Respond to the Top Search Queries About Urological Malignancies?6 ChatGPT (3.5)
Perplexity
Chat Sonic
Microsoft Bing AI
April 10, 2023 (all)
Oncology DISCERN No appropriateness assessment
4 Out of 5 for quality of information
Moderate understandability (PEMAT-P of 66.7%), and actionability was moderate to poor (40%)
Evaluating the Performance of ChatGPT in Answering Questions Related to Urolithiasis7 ChatGPT (3.5)
August 3, 2023
Nephrolithiasis 4-Point Likert scale 95% Completely correct
Development of a Personalized Chat Model Based on the European Association of Urology Oncology Guidelines: Harnessing the Power of Generative Artificial Intelligence in Clinical Practice8 Uro_Chat
ChatGPT (3.5 and 4.0)
Oncology Yes or no for adequate response 100% Adequate Custom model developed based on EAU guidelines; freely available online
ChatGPT and Most Frequent Urological Diseases: Analysing the Quality of Information and Potential Risks for Patients9 ChatGPT (4.0)
March 18, 2023
Oncology
Nononcology
DISCERN Not assessed
Abbreviations: EAU, European Association of Urology.

ChatGPT (version 3.5) is the most studied AI chatbot model in our field (Table). Common response attributes evaluated by research groups have included appropriateness, comprehensiveness, clarity, and reproducibility. These assessments were performed using a variety of scoring instruments across the different studies. Appropriateness, although defined differently in different studies, is used by the researchers as a surrogate end point which combines an assessment of response accuracy, comprehensiveness, and clarity. Across studies, 52% to 78% of ChatGPT responses were deemed appropriate.3-5 The most commonly reported reason for judging responses to be inappropriate was the absence of vital information, and response clarity was the least common reason.4 Concerningly, the one study which examined response reproducibility within a single chatbot (ChatGPT version 3.5) by posing the same question in 3 independent chat instances found substantial response inconsistency—25% of question sets produced responses with dissimilar appropriateness ratings.3 The LLM was also noted by multiple groups to provide responses at a difficult reading level, rated as college level or higher.4-6 This body of research highlights that despite the hype and excitement about this new tool, the information currently provided by chatbots to urologic health care queries is generally of low quality and should not be considered actual, reliable medical advice.

image
Figure. The Chatbot will see you now. This image was created by the authors using ChatGPT 4 with DALL-E plugin, October 16, 2023 version.

One of the major limitations of the current body of research on LLMs in health care is that there does not currently exist a validated assessment instrument by which the models’ responses can be evaluated in a standard and reproducible fashion. Researchers so far have designed their own Likert scales or adapted existing evaluation tools which were originally designed for other purposes (eg, DISCERN, Brief DISCERN, QUEST). Given the rapid adoption and the degree to which LLMs have already become entrenched within society, as well as the potential for real harm from low-quality responses to potentially serious medical questions, there is an urgent need for the development of a standardized evaluation tool that is specifically designed to assess chatbot responses to medical questions. Such a tool will be critical for the performance and interpretation of high-quality future research in this emerging field.

In summary, chatbot responses to urologic health care queries are sometimes appropriate. While they are often written in clear language, they frequently provide information which is either not factual or not comprehensive. Responses are also limited in their practicality by being written at a very high reading level and often lacking actionable instructions for their end users, who may be patients. Nonetheless, it is increasingly apparent that AI and chatbots are here to stay (Figure). We as a field can choose to ignore the proverbial elephant in the room, or we can choose to participate meaningfully in their evaluation and evolution to ensure that future iterations of these tools wield their considerable computational power for healing and not for harm.

  1. Roose K. The brilliance and weirdness of ChatGPT. New York Times. December 5, 2022:B1.
  2. Eppler M, Ganjavi C, Ramacciotti LS, et al. Awareness and use of ChatGPT and large language models: a prospective cross-sectional global survey in urology. Eur Urol. 2023;10.1016/j.eururo.2023.10.014.
  3. Whiles BB, Bird VG, Canales BK, DiBianco JM, Terry RS. Caution! AI bot has entered the patient chat: chatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;180:278-284
  4. Davis R, Eppler M, Ayo-Ajibola O, et al. Evaluating the effectiveness of artificial intelligence-powered large language models application in disseminating appropriate and readable health information in urology. J Urol. 2023;210(4):688-694.
  5. Cocci A, Pezzoli M, Lo Re M, et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 2023;10.1038/s41391-023-00754-3.
  6. Musheyev D, Pan A, Loeb S, Kabarriti AE. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies?. Eur Urol. 2023;10.1016/j.eururo.2023.07.004.
  7. Cakir H, Caglar U, Yildiz O, Meric A, Ayranci A, Ozgor F. Evaluating the performance of ChatGPT in answering questions related to urolithiasis. Int Urol Nephrol. 2023;10.1007/s11255-023-03773-0.
  8. Khene Z-E, Bigot P, Mathieu R, Rouprêt M, Bensalah K. Development of a personalized chat model based on the European Association of Urology oncology guidelines: harnessing the power of generative artificial intelligence in clinical practice. Eur Urol Oncol. 2023;10.1016/j.euo.2023.06.009.
  9. Szczesniewski JJ, Tellez Fouz C, Ramos Alba A, Diaz Goizueta FJ, García Tello A, Llanes González L. ChatGPT and most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol. 2023;41(11):3149-3153.

advertisement

advertisement