Attention: Restrictions on use of AUA, AUAER, and UCF content in third party applications, including artificial intelligence technologies, such as large language models and generative AI.
You are prohibited from using or uploading content you accessed through this website into external applications, bots, software, or websites, including those using artificial intelligence technologies and infrastructure, including deep learning, machine learning and large language models and generative AI.

What Machine Learning Offers in Elucidating Vesicoureteral Reflux

By: Mandy Rickard, MN, NP; Armando J. Lorenzo, MD, MSc, FRCSC, FAAP, FACS | Posted on: 01 Jun 2022

The management of vesicoureteral reflux (VUR) has evolved over recent years. Efforts in risk stratification, more selective use of antibiotic prophylaxis and less invasive means of surgical correction are some examples of aspects of care that have dramatically changed.1,2 With little doubt, innovation and critical thinking will continue to help evolve the approach to this common condition.

Interestingly, in contrast to these advances, one important element has remained largely unchanged: the grading of reflux. Since its initial description in the 1980s for the purpose of standardizing the evaluation of voiding cystourethrograms of patients enrolled in the International Reflux Study in Children, the 5-grade classification system continues to be favored by clinicians and researchers worldwide.3

Grading systems are important tools. In the setting of reflux, they aim to standardize the description of severity, which has important prognostic and therapeutic connotations. In theory, when interpreting a voiding cystogram, determining a child has a specific grade of VUR should be a straightforward endeavor, irrespective of the provider or health care system.

Unfortunately, herein lies one of the grading system’s greatest weaknesses: reliability. Different groups have reported issues in this arena, particularly when including different specialties and providers with varying levels of expertise.4 Despite some controversies, one thing is clear. Inter- (and intra-) rater reliability is far from perfect, particularly in situations where there are “moderate” degrees of reflux (grades III and IV), thus creating issues with misclassification for a large number of patients.5 It is easy to see how this can generate discrepancies in management and make comparison of clinical series problematic. Although this is probably a multifactorial problem, the most likely culprit appears to be the subjective interpretation of terms such as dilation, tortuosity and calyceal blunting.

“In theory, when interpreting a voiding cystogram, determining a child has a specific grade of VUR should be a straightforward endeavor, irrespective of the provider or health care system.”

One potential solution is to devise a whole new grading system, with less ambiguous wording and—potentially—a lower number of categories. Although tempting, this strategy would eliminate a system that has stood the test of time and would make the extrapolation of data from previous research efforts challenging. As an alternative, we can explore ways to improve reliability. Enter machine learning.

Machine learning, as a subfield of artificial intelligence, employs correlations in images and metrics to build analytical models for different purposes, such as classification and prediction.6 As a proof of concept, we developed a novel machine learning-based model called qVUR, which employs 4 objective imaging features to determine reflux grade: proximal, maximum and distal ureter width, and a numerical construct that reflects ureter tortuosity (see Figure). By mathematically defining some of the subjective imaging parameters (“tortuosity” and “dilation”), we add precision and paradoxically make analyses easier.

In our initial study,7 these features predicted low (I-III) versus high (IV-V) VUR grade with good performance. Currently, we are in the process of refining our model to predict individual reflux grades analyzing images of more than 1,000 cases from multiple centers across North America. Our preliminary data show promising results (including an AUC of 0.83). Perhaps more importantly, we found that the reliability of each VUR grade was 4-fold greater with qVUR compared to traditional clinician grading.8

Figure. Overview of qVUR workflow, where user-generated inputs are fed into a trained decision tree to determine reflux grade. Norm., normal.

We believe this approach can lead to a more refined, modern way of grading reflux. The judicious use of computational aids along with other imaging parameters (such as the timing of reflux detection during micturition cycle) and other clinical data (demographics, mode of presentation) can not only aid in delivering personalized care, but also make the approach more homogeneous and studies more generalizable. By being able to reliably grade and remove subjective assessments, we can build stronger, more robust data sets. If we deploy these technologies in a fair and equitable way, families and less experienced providers may be empowered to independently explore options and have more meaningful discussions with specialists. In other words, we may be able to truly “speak the same language.” More objective and reproducible grading may also open the door to better correlation with emerging imaging strategies (particularly contrast ultrasound), potentially allowing us to apply treatment algorithms for patients with diagnoses established with different modalities. Akin to other conditions in urology, well-crafted and -deployed machine learning strategies hold the promise of becoming strong clinical tools and not just a passing fad.

Unfortunately, there are important limitations and challenges ahead. Two critical ones are ease of use and interpretability. Models need to be user-friendly and available to busy clinicians on demand. To explore ways to implement our model, we deployed a Web application where users can upload screenshots and obtain an immediate output with minimal additional work (https://akhondker.shinyapps.io/qVUR/). Also, this novel methodology often relies on patterns determined by a computer, which can create a “black box” effect (ie not knowing how the model classified each image the way it did). Thus, models should use features that are physiologically sound and rely—at least in part—on clinician input. There is also the risk of introducing tools that are flawed or biased, which can have disastrous consequences. Diligent and regular reassessment of the data sets (including any new information entered to refine the model) is required, perhaps more than with traditional statistical strategies.9 Lastly, there are bioethical issues, access to technology and acceptance by the academic community. These are not insurmountable but clearly deserve attention as we move forward.

Ultimately, reflux is important because it can lead to adverse outcomes (pyelonephritis, renal scarring, hypertension, renal insufficiency).10,11 Reflux grading should be part of a systematic assessment that translates into disease severity and impact to the individual child. In the future, one can imagine a model where image analysis seamlessly and directly predicts which patients are likely to have spontaneous resolution, who is at risk for developing renal scarring and who would benefit from interventions (antibiotic prophylaxis or surgery). Artificial intelligence holds great promise in helping us achieve this goal.

  1. Wang H-HS, Gbadegesin RA, Foreman JW et al: Efficacy of antibiotic prophylaxis in children with vesicoureteral reflux: systematic review and meta-analysis. J Urol 2015; 193: 963.
  2. Blais A-S, Bolduc S and Moore K: Vesicoureteral reflux: from prophylaxis to surgery. Can Urol Assoc J 2017; 11: S13.
  3. Lebowitz RL, Olbing H, Parkkulainen KV et al: International system of radiographic grading of vesicoureteric reflux. Pediatr Radiol 1985; 15: 105.
  4. Metcalfe CB, MacNeily AE and Kourosh A: Reliability assessment of international grading system for vesicoureteral reflux. J Urol 2012; 188: 1490.
  5. Schaeffer AJ, Greenfield SP, Ivanova A et al: Reliability of grading of vesicoureteral reflux and other findings on voiding cystourethrography. J Pediatr Urol 2017; 13: 192.
  6. Chen J, Remulla D, Nguyen JH et al: Current status of artificial intelligence applications in urology and their potential to influence clinical practice. BJU Int 2019; 124: 567.
  7. Khondker A, Kwong JCC, Rickard M et al: A machine learning-based approach for quantitative grading of vesicoureteral reflux from voiding cystourethrograms: methods and proof of concept. J Pediatr Urol 2022; 18: 78.e1.
  8. Khondker A, Kwong JCC, Yadav P et al: Moving towards quantitative vesicoureteral reflux grading using machine learning. Presented at fall congress of Societies for Pediatric Urology, Miami, Florida, December 2-5, 2021.
  9. Kwong J, McLoughlin L, Haider M et al: Standardized reporting of machine learning applications in urology: the STREAM-URO framework. Eur Urol Focus 2021; 7: 672.
  10. Tekgül S, Dogan HS, Hoebeke P et al: EAU Guidelines on Paediatric Urology. Arnhem, The Netherlands: European Association of Urology 2016; pp 290–323.
  11. Peters CA, Skoog SJ, Arant BS et al: Summary of the AUA guideline on management of primary vesicoureteral reflux in children. J Urol 2010; 184: 1134.