Optimization and Validation of ChatGPT and GPT-4 as a Reasoning Engine in Clinical Urology Practice

By: Mohamed Javid, MS, MRCS, Chengalpattu Medical College, India; Madhu Reddiboina, MS, RediMinds, South Field, Michigan; Mahendra Bhandari, MD, MBA, Vattikuti Urology Institute, Henry Ford Hospital, Detroit, Michigan, Vattikuti Foundation, Detroit, Michigan | Posted on: 06 Jul 2023

The large language models such as Open AI’s ChatGPT (Chatbot Generative Pre-trained Transformer) and GPT-4 are emerging as powerful tools for real-time clinical decision support.^1,2 This cutting-edge technology has the remarkable ability to understand, generate, and respond to natural language (natural language processing), thereby facilitating more meaningful human-machine interaction.¹

We report our overall analysis and initial experience of developing interactive strategies and prompt generation from archived retrospective clinical scenarios to get pointed answers to clinically relevant questions to support ChatGPT to match the standard of care. We archived 4 patients retrospectively from our personal referral collection of urological cancers at different stages of disease who had their treatments with a variety of options. All 4 patients provided their records for tertiary opinion and needed assistance from the authors in the shared decision-making process.

Scenario 1

A 75-year-old woman with an incidentally found localized right renal lesion underwent a successful radical nephrectomy, and histopathology revealed localized clear cell renal cell carcinoma (RCC). Adjuvant chemotherapy with tablet sunitinib was administered, but she developed severe throat ulceration, and hematemesis due to gastric perforation. Despite all efforts to manage the complication, she ultimately passed away.

Scenario 2

A 39-year-old female patient presented with an incidentally detected well-defined echogenic renal lesion. Subsequent imaging suggested a Bosniak type IV right renal lesion and a partial nephrectomy was performed. Histological analysis and immunohistochemistry reported a localized chromophobe RCC.

Scenario 3

A 75-year-old male with multiple comorbidities and a very long history of urothelial cancer and prostate carcinoma underwent intravesical bacillus Calmette-Guérin therapy for high-grade pT2 urothelial bladder cancer. After 6 months, cystoscopy revealed recurrence and subsequent biopsy confirmed high-grade papillary and solid urothelial carcinomas at stage pT1.

Scenario 4

A 54-year-old female with a persistent cough and abdominal mass was diagnosed with sarcomatoid RCC, with lung, bone, and vertebral metastases. Treatment with pazopanib showed significant tumor reduction and decreased metabolic activity. Local radiation therapy and tablet zoledronic acid were administered, leading to recovery and alleviation of symptoms.

From the above scenarios, we generated probing questions for a rigorous interactive process with ChatGPT as well as GPT-4 to optimize responses by iterating input prompts until we could extract answers to our specific questions. The final answers of the open artificial intelligence (AI) tools were validated with the standard of care, guidelines, and peer-reviewed published literature. We also compared the responses from ChatGPT with GPT-4 (see Figure).

Figure. A schematic illustration of the workflow of the optimization and validation process.

ChatGPT demonstrated the ability to provide insightful and clinically relevant responses and even suggested follow-up protocols at each critical juncture within the case scenarios. Its responses were compared to established clinical practices to assess its credibility. Whenever there was a deviation from the usual practice or guideline, it was able to rationally reason and the interactive discussion matched knowledgeable human-human discussion.

Different questioning styles were tested to identify the most effective approach for eliciting suitable AI responses. Conversations with itemized lists seemed to be the most efficient way to obtain targeted answers. Despite efforts to reduce ChatGPT’s legally cautious responses like “I am not a doctor but can provide general suggestions based on the information given,” the issue persisted. Hence, users must exercise discretion and disregard inadvertently generated legally sensitive phrases, underscoring the need for ethical considerations when employing AI in health care settings.

When analyzing the AI’s ability to maintain the primary focus of a complex conversation while branching off into related subtopics, ChatGPT seamlessly handled multiple subtopics and transitioned to the central theme when guided by appropriate prompts. This will be helpful in conversing multiple subtopics, such as different management options for a single patient.

ChatGPT’s versatility in identifying overlooked possibilities or alternative explanations for clinical presentations can complement human expertise and aid decision-making. For example, in scenario 1, ChatGPT suggested sunitinib-induced gastritis and complications, even without details on the patient’s gastric perforation. Literature confirms this is a rare but known complication.³ Similarly, in scenario 3, ChatGPT proposed Lynch syndrome, a genetic disorder resulting from inherited mutations in DNA mismatch repair genes and increasing cancer risk, as a possible familial cancer syndrome. Although not a classical Lynch syndrome case satisfying contemporary diagnostic criteria, a literature review indicated the possibility couldn’t be dismissed, necessitating further information and investigation.^4,5

Interestingly, ChatGPT refined responses when presented with better reasoning or alternative explanations. For example, it initially suggested tumor markers for RCC evaluation, but after follow-up questions, it clarified their potential lack of benefit for that patient. It seems the AI tool uses generic and general intelligence related to cancers.

ChatGPT’s ability to predict patient prognosis and recurrence, using general statements or specific tools like nomograms when given sufficient data, has been verified to be valid.^6,7 With appropriate prompts, ChatGPT can identify its data needs and deficits, enabling users to actively provide relevant information and further optimize its performance. Furthermore, engaging in genuine follow-up dialogues and careful selection of the most suitable prompts can clarify uncertainties and improve the overall utility of ChatGPT’s responses, emphasizing the value of effective communication and iterative refinement in AI-assisted decision-making.

We found the latest version of GPT-4 more logical and capable of in-depth analysis than ChatGPT. These tools use significant general intelligence and repeatedly respond with the same disclaimers. This shows the rigorous reinforcement learning with human feedback being employed before making GPT-4 available for public consumption.⁸

In our preliminary experience, the technology should be carefully used with a complementary intent rather than competitive to human knowledge. Our findings can contribute to a deeper understanding of ChatGPT’s potential in various clinical contexts and comprehension of the AI system’s strengths and weaknesses to maximize its utility in health care settings. Large language models such as ChatGPT and GPT-4 are major breakthroughs in AI applications in the clinical practice of medicine. Hence, we would highly recommend that urologists use it liberally as a complementary tool. These technologies are thirsty for human interaction to become highly validated augmented intelligence tools.

Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11(6):887.
Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific writing?. Crit Care. 2023;27(1):75.
Porta C, Paglino C, Imarisio I, Bonomi L. Uncovering Pandora’s vase: the growing problem of new toxicities from novel anticancer agents. The case of sorafenib and sunitinib. Clin Exper Med. 2007;7(4):127-134.
Lindner AK, Schachtner G, Tulchiner G, et al. Lynch syndrome: its impact on urothelial carcinoma. Int J Mol Sci. 2021;22(2):531.
Lonati C, Simeone C, Suardi N, Spiess PE, Necchi A, Moschini M. Genitourinary manifestations of Lynch syndrome in the urological practice. Asian J Urol. 2022;9(4):443-450.
Marshall FF. Use of the University of California Los Angeles Integrated Staging System to predict survival in renal cell carcinoma: an international multicenter study. J Urol. 2005;173(5):1530-1530.
Kattan MW, Reuter V, Motzer R, Katz J, Russo P. A postoperative prognostic nomogram for renal cell carcinoma. J Urol. 2001;166(1):63-67.
OpenAI: GPT-4 System Card. March 23, 2023. Accessed April 10, 2023. https://cdn.openai.com/papers/gpt-4-system-card.pdf.