New Literature Review Finds ChatGPT Effective in Radiology in 84 Percent of Studies

April 29, 2024

News

Article

While noting a variety of pitfalls with the chatbot ranging from hallucinations to improper citations, the review authors found the use of ChatGPT in radiology demonstrated “high performance” in 37 out of 44 studies.

One study noted that ChatGPT had an 88.9 percent accuracy rate in determining appropriate imaging for breast pain. Five studies found an 83.6 percent median agreement between ChatGPT and reference standards such as guidelines or radiologist decisions. The use of GPT-4 reportedly offers enhanced capabilities for responding to higher-order thinking questions in radiology.

These are some of the findings from a systematic review, recently published in Diagnostic and Interventional Imaging, of 44 studies evaluating the use of ChatGPT in radiology. Researchers examined findings from studies looking at the use of the chatbot for adjunctive support in decision making, structuring radiology reports, generating radiology reports, improving patient communication, performance on radiology board exams, and as a standalone tool, according to the study.

Overall, the review authors pointed out that 37 out of the 44 reviewed studies noted “high performance” of ChatGPT in radiology applications and the remaining seven studies found “lower performance.”

New Literature Review Finds ChatGPT Effective in Radiology in 84 Percent of Studies

A recently published systematic review examining the use of ChatGPT in radiology found "high performance" in 37 out of 44 studies.

Seventy percent of studies (14/20) found that adjunctive use of ChatGPT provided significant improvement in radiologist decision making. The researchers also noted that 100 percent of studies examining the use of ChatGPT with radiology reports cited significant benefits in structuring and simplifying the reports (8/8) as well as generating radiology reports (4/4). Five out of six studies suggested that ChatGPT facilitated enhanced patient communication.

“The findings suggest that ChatGPT shows promise in 84.1% of the studies and has the potential to significantly contribute to five broad clinical areas of radiology, including providing diagnostic and clinical decision support, transforming, simplifying and generating radiology reports, patient communication and outcomes, and performance on radiology board examinations,” wrote lead review author Pedram Keshavarz, M.D., a postdoctoral research fellow affiliated with the Department of Radiological Sciences at the David Geffen School of Medicine at the University of California, Los Angeles, and colleagues.

The researchers noted that 11 of the 44 reviewed studies compared ChatGPT 3.5 vs. GPT-4 with over 90 percent of these studies reporting enhanced capabilities with GPT-4 in addressing questions that require more advanced reasoning.

Research from 2023 noted a 60 percent accuracy rate with ChatGPT in responding to higher-order radiology board-examination type questions and a 21 percent improvement (81 percent) in answering the same questions with GPT-4.

“When comparing ChatGPT versions (v3.5 vs. v4), ChatGPTv4 showed a superior contextual understanding of radiology-specific terms and imaging descriptions. Further studies have pointed to ChatGPTv4’s potential in generating structured radiology reports, providing detailed report explanations and benefits,” pointed out Keshavarz and colleagues.

Three Key Takeaways

Diagnostic support and decision making. ChatGPT demonstrated high performance in approximately 84 percent of the studies. More specifically, the chatbot demonstrated a high accuracy rate of 88.9 percent in determining appropriate imaging for breast pain and a median agreement of 83.6 percent with reference standards. Adjunctive use of ChatGPT significantly improved radiologist decision making in 70 percent of cases, suggesting its potential as a diagnostic support tool.
Radiology reporting. ChatGPT significantly benefits the structuring, simplification, and generation of radiology reports, as noted in all studies examining its use with radiology reports. This suggests its utility in enhancing efficiency and clarity in reporting processes, potentially improving workflow and communication.
Limitations in subspecialty radiology. While ChatGPT demonstrates promise as a supplementary tool in radiology decision-making, its standalone use reveals significant limitations, particularly in subspecialty areas such as interventional radiology. One study noted a 40 percent accuracy rate in answering basic questions related to interventional radiology, indicating its struggles with specialized knowledge domains. Furthermore, ChatGPT was outperformed by neuroradiologists in another study, underscoring the importance of human expertise in complex and nuanced radiological interpretations. These findings caution against over-reliance on ChatGPT and emphasize the necessity of human expertise, especially in subspecialty areas.

While suggesting the technical feasibility of standalone automation of radiology tasks with ChatGPT, the review authors noted the potential for significant errors, particularly in subspeciality areas of radiology. One of the reviewed studies revealed a 40 percent accuracy rate for ChatGPT in answering basic questions related to interventional radiology. Other research showed that ChatGPT was significantly outperformed by neuroradiologists.

“These findings show ChatGPT's role as a supplementary tool in clinical decision-making rather than a replacement for experienced professionals,” maintained Keshavarz and colleagues.

(Editor’s note: For related content, see “Can GPT-4 Improve Accuracy in Radiology Reports?,” “What New Research Reveals About ChatGPT and Ultrasound Detection of Thyroid Nodules” and “Can ChatGPT Pass a Radiology Board Exam?”)

Noting that all of the studies cited limitations with ChatGPT, ranging from hallucinations and fictitious references to privacy concerns, the study authors emphasized that none of the reviewed studies suggested the use of ChatGPT findings without radiologist review.

“Our study highlights (the) critical need for careful verification of the factual accuracy and relevance of responses from LLMs (large language models) when used in clinical settings, where incorrect information could disrupt medical operations or lead to detrimental outcomes, regardless of the response's confidence,” added Keshavarz and colleagues.