Can ChatGPT adequately address common questions from patients about medical imaging?
With this question in mind, researchers recently examined the viability of the generative pre-trained language model (GPLM) to provide accurate and relevant answers to 22 imaging questions related to safety, imaging procedures, terminology, the radiology report, and other topics identified as being important to patients.
Asking the questions three times in order to assess consistency with the ChatGPT (version 3.5, OpenAI) responses, the researchers examined the use of no additional question prompt and the addition of a modifying prompt to emphasize accuracy and readability for the average person, according to the study, which was recently published in the Journal of the American College of Radiology.
The researchers found no significant difference in accuracy between unprompted ChatGPT (version 3.5) responses (82.6 percent) and responses to an additional modifying prompt (86.7 percent). For no-prompt responses, the study authors noted consistency of responses 71.6 percent of the time, but this percentage increased to 86.4 percent with responses to modifying prompts.
“Automating the development of patient health educational materials and providing on-demand access to medical questions holds great promise to improve patient access to health information,” wrote study co-author Alessandro Furlan, M.D., an associate professor of radiology and chief of the Abdominal Imaging Section at the University of Pittsburgh Medical Center (UPMC), and colleagues.
Researchers noted that complete relevance only occurred with 66.7 percent of unprompted ChatGPT responses and 79.6 percent of responses to modifying prompts. The study authors acknowledged that the lowest percentages for full relevance were seen with safety-related questions. Only 50 percent of the no prompt responses to safety questions were deemed fully relevant with an increase to 64.6 percent when there was a modifying prompt.
While the researchers found that 98.5 percent of unprompted responses and 98.9 percent of responses to modifying prompts were at least partially relevant to the posed questions, they noted that complete relevance only occurred with 66.7 percent of unprompted ChatGPT responses and 79.6 percent of responses to modifying prompts.
The researchers acknowledged that the lowest percentages for full relevance were seen with safety-related questions such as “What are the risks of MRI during pregnancy?” and “Do X-rays or CT scans cause cancer?” Only 50 percent of the no prompt responses to safety questions were deemed fully relevant with an increase to 64.6 percent when there was a modifying prompt.
“While the accuracy, consistency, and relevance of the ChatGPT responses to imaging-related questions are impressive for a GPLM, they are imperfect,” noted Furlan and colleagues. “By clinical standards, the frequency of inaccurate statements that we observed precludes its use without careful human supervision or review.”
Three Key Takeaways
- ChatGPT's accuracy and consistency. Researchers found that ChatGPT (version 3.5) provided over 80 percent accuracy with responses to medical imaging questions and had a 71.6 percent consistency rate for unprompted responses and 86.4 percent with modifying prompts.
- Relevance of responses. While most responses were at least partially relevant to the questions, complete relevance was lower, with 66.7 percent of unprompted responses and 79.6 percent of prompted responses considered fully relevant. Safety-related questions had the lowest full relevance percentages.
- Readability issues. The readability of ChatGPT responses was a concern, as none of the responses were at or below an eighth-grade reading level. The high complexity of responses could hinder patient access to health information.
(Editor’s note: For related content, see “Can ChatGPT Pass a Radiology Board Exam?,” “Can ChatGPT Provide Appropriate Information on Mammography and Other Breast Cancer Screening Topics?” and “Can ChatGPT Have an Impact in Radiology?”)
Utilizing the Flesch-Kincaid readability testing, the researchers noted no significant differences between prompted and unprompted ChatGPT responses with respect to reading level. The study authors pointed out that none of the responses were at or below an eighth grade reading level. They added that only 30 percent of unprompted ChatGPT responses and 41 percent of prompted responses were below a 12th-grade reading level.
“The ability to understand health information presented to patients is crucial for their capacity to make informed medical decisions,” maintained Furlan and colleagues. “As it currently stands, the high complexity of the responses clouds the promise of true patient access to health information.”
In regard to study limitations, the authors conceded that the rapidly evolving technology with ChatGPT is likely to impact the effectiveness of the platform in answering common patient questions on medical imaging. Pointing out that the questions used to assess ChatGPT responses were written by radiologists, the researchers noted these questions may not reflect the variability with which patients may ask similarly themed questions. The study authors also noted a lack of clarity as to how ChatGPT would handle questions posed in other languages than English.