Research shows artificial intelligence models rely on shortcuts for detecting COVID-19.
Relying on shortcuts for work isn’t always a great idea – and that holds true for artificial intelligence (AI) with chest X-rays and COVID-19, as well.
New research from the University of Washington (UW), shows that AI, like humans, has a tendency to lean on shortcuts for disease detection with these scans. If the tools are deployed clinically, investigators said, the result could be diagnostic errors that impact real patients.
Rather than learning from actual medical pathology and clinically significant indicators, the team, led by Paul G. Allen School of Computer Science & Engineering doctoral students, Alex DeGrave, who is also a medical student in the UW Medical Scientist Training Program, and Joseph Janizek, also a UW medical student, showed that algorithms used during the pandemic relied on text markers and patient positioning specific to each dataset to predict whether someone was COVID-19-positive.
The team published their results May 31 in Nature Machine Intelligence.
“A physician would generally expect a finding of COVID-19 from an X-ray to be based on specific patterns in the image that reflect disease processes,” DeGrave said. “But, rather than relying on those patterns, a system using shortcut learning might, for example, judge that someone is elderly and, thus, infer that they are more likely to have the disease because it is more common in older patients. The shortcut is not wrong per se, but the association is unexpected and not transparent. And, that could lead to an inappropriate diagnosis.”
By depending on these shortcuts, the team said, the algorithms are far less likely to work outside of their original setting where they were trained and tested. Models that do use shortcuts have a greater chance of failure when used in a different facility, opening the door for potentially incorrect diagnoses and improper treatment, he added.
It’s the lack of transparency – the “black box” phenomenon associated with AI – that contributes heavily to the problem, the team said. And, they suspected it existed with the models touted as detecting COVID-19, so they conducted their own training and testing to assess the models’ trustworthiness. Specifically, they focused on “worst-case confounding,” a situation, resulting in shortcuts, that arises when an AI tool lacks sufficient training data to learn the underlying pathology of a disease.
“Worst-case confounding is what allows an AI system to just learn to recognize datasets instead of learning any true disease pathology,” Janizek said. “It’s what happens when all of the COVID-19-positive cases come from a single dataset while all of the negative cases are in another. And, while researchers have come up with techniques to mitigate associations like this in cases where those associations are less severe, these techniques don’t work in situations where you have a perfect association between an outcome, such as COVID-19 status, and a factor like the data source.”
For their study, the team trained several deep convolutional neural networks on X-ray images from a dataset that replicated the approach that was used in existing papers. In a two-step process, they tested each model’s performance with an internal image set from the initial dataset that was not part of the training data, and they tested performance on a second, external dataset that was meant to be a new hospital system.
Their analysis showed that once models were removed from their original setting – where they showed high performance – they floundered. In outside environments, accuracy fell by half. DeGrave and Janizek’s team called this the “generalization gap,” asserting that confounding factors were behind the models’ predictive success with the initial dataset.
The team also applied explainable AI techniques in an attempt to pinpoint image features that played the largest role in each model’s predictions. They had similar results when they tried to reduce confounding by training the models with a second dataset that included both positive and negative COVID-19 images pulled from similar sources.
These results, the team said, shed light on the degree to which high-performance medical AI system can depend on undesirable, unreliable shortcuts for disease detection. They also contradict the belief that confounding isn’t as much of a danger when datasets come from similar sources. Fortunately, DeGrave said, it is unlikely that the models tested in this study were deployed clinically, largely because providers still use RT-PCR for COVID-19 diagnosis rather than chest X-rays.
Overall, said senior author Su-In Lee, Ph.D., associate professor in the Allen School, the findings from this study underscore the fundamental role explainable AI will play in ensuring the safety and efficacy of these models in medical decision-making.
“Our findings point to the importance of applying explainable AI techniques to rigorously audit medical AI systems,” he said. “If you look at a handful of X-rays, the AI system might appear to behave well. Problems only become clear once you look at many images. Until we have methods to more efficiently audit these systems using a greater sample size, a more systematic application of explainable AI could help researchers avoid some of the pitfalls we identified with the COVID-9 models.”
For more coverage based on industry expert insights and research, subscribe to the Diagnostic Imaging e-Newsletter here.