Attention: Restrictions on use of AUA, AUAER, and UCF content in third party applications, including artificial intelligence technologies, such as large language models and generative AI.
You are prohibited from using or uploading content you accessed through this website into external applications, bots, software, or websites, including those using artificial intelligence technologies and infrastructure, including deep learning, machine learning and large language models and generative AI.

UPJ INSIGHT AUA Committee Members Rate Artificial Intelligence–Generated Responses for Female Stress Urinary Incontinence

By: Annie Chen, MD, Houston Methodist Hospital, Texas; Jerril Jacob, MHA, University of Texas Health Houston; Kuemin Hwang, MD, Houston Methodist Hospital, Texas; Kathleen Kobashi, MD, Houston Methodist Hospital, Texas; Ricardo R. Gonzalez, MD, Houston Methodist Hospital, Texas | Posted on: 17 Jul 2024

Chen A, Jacob J, Hwang K, Kobashi K, Gonzalez RR. AUA Guideline Committee members determine quality of artificial intelligence–generated responses for female stress urinary incontinence. Urol Pract. 2024;11(4):693-698. doi:10.1097/UPJ.0000000000000577

Study Need and Importance

Stress urinary incontinence (SUI) affects countless women worldwide. Given ChatGPT’s rising ubiquity, patients may turn to the platform for SUI advice. The urologic community needs to critically evaluate this platform’s output if patients are to use it for adjunctive medical counsel. The objective of this study was to have experts in the field evaluate the quality of clinical information about SUI from the ChatGPT platform (Table).

What We Found

AUA committee members, who are experts in the field, rate ChatGPT-produced responses on SUI as moderate to moderately high quality, moderate reliability, excellent understandability, and poor actionability utilizing standardized questionnaires. The reading level of the material was advanced, which is an area of improvement to make generated responses more comprehensible.

Limitations

Surveys were conducted based on a single ChatGPT query. Variability will exist between responses to the same query at different times. Additionally, this is a small sampling of experts in the field, which may introduce expert bias.

Interpretation for Patient Care

Although the quality of information pertaining to SUI is rated highly, the authors recommend holding ChatGPT to the highest possible standard of complete accuracy and reliability at the minimum. In order to reach this population of patients who desire to converse with natural language processors to obtain health information, the urologic and gynecologic communities should develop this technology and integrate existing patient handouts from major societies like the AUA; Society of Urodynamics, Female Pelvic Medicine & Urogenital Reconstruction; American College of Obstetricians and Gynecologists; and the International Continence Society to disseminate high-quality information to patients.

Table. Average Scores for DISCERN and Patient Education Materials Assessment Tool Standardized Surveys by Category

Definition Diagnosis (SD) Management Surgery specific Overall
DISCERN, average 3.63
Reliability
Average (SD) 3.25 (1.7) 3.00 (1.6) 3.10 (1.60) 3.29 (1.56) 3.16
Raw score average (SD) 65 (5.2) 60 (13.8) 62.3 (9.3) 65.8 (9.8) 63.3
Quality of treatment descriptions if described
Average (SD) 2.38 (1.6) 2.8 (1.3) 3.17 (1.13) 3.62 (1.33) 2.99
Raw score average (SD) 47.6 (5.9) 56.2 (8.2) 63.5 (8.7) 73.5 (15.8) 60.2
Overall quality, score (%) 4 (80) 3.3 (67) 3.6 (72) 4 (80) 3.73 (74.6)
PEMAT
% Understandability (SD) 95.2 (6.7) 88.1 (6.7) 92.9 (7.1) 81.4 (9.5) 89.4
% Actionability (SD) 14.6 (10.6) 18.8 (7.2) 19.3 (4.7) 19.4 (6.2) 18.0
Accuracy, average 3.83 3.33 3.17 3.67 3.5
Abbreviations: PEMAT, Patient Education Materials Assessment Tool.
Average score for 4-point Likert scale for accuracy.

advertisement

advertisement