In what may be the first study to compare the use of large language models across multiple areas of clinical screening, researchers suggested that ChatGPT-4 and Bard may play a beneficial role with radiology decision-making for assessment of common cancers such as breast cancer and lung cancer.
For the study, recently published in Academic Radiology, researchers examined the use of prompt engineering to enhance the accuracy of the large language models relating to the appropriate use of imaging for conditions including breast cancer, ovarian cancer, colorectal cancer, and lung cancer.
Employing American College of Radiology (ACR) Appropriateness Criteria, the researchers compared the performance of ChatGPT-4 (OpenAI) and Bard (Google) with open-ended (OE) prompts and more specific select all that apply (SATA) prompts.
For breast cancer screening, the researchers noted fairly similar accuracy between ChatGPT-4 and Bard for both prompts. ChatGPT-4 had a 1.82 (out of 2) average OE prompt score in comparison to 1.89 for Bard. Bard demonstrated an 82 percent accuracy with SATA prompts in breast cancer screening and ChatGPT-4 offered 85 percent accuracy, according to the study authors.
“We observed that ChatGPT-4 and Google Bard displayed impressive accuracy in making radiologic clinical decisions when prompted in either OE or SATA formats,” wrote study co-author Young H. Kim, M.D., Ph.D., who is affiliated with the University of Massachusetts Chan Medical School in Worcester, Mass., and colleagues.
However, the study authors did note a few differences between the large language models.
• Average scoring with the use of predefined options in SATA prompts showed that ChatGPT-4 outperformed Bard across all cancer imaging in the study with the difference being more pronounced with ovarian cancer screening. Overall, researchers pointed out an 83 percent average accuracy score for ChatGPT-4 in comparison to 70 percent for Bard.
• For average OE prompt scoring, ChatGPT-4 outperformed Bard for lung cancer and ovarian cancer screening while Bard was slightly better for breast and colorectal cancer screening, according to the study.
• Assessment of the large language models for ovarian cancer screening revealed the most significant difference. For ovarian cancer screening, Bard had an OE prompt score of .50 (out of 2) in comparison to 1.50 for ChatGPT-4. The researchers also noted 41 percent accuracy for Bard on SATA prompts in contrast to 70 percent for ChatGPT-4 on ovarian cancer screening.
Three Key Takeaways
- Comparable accuracy in breast cancer screening. The study found that both ChatGPT-4 and Google Bard demonstrated impressive accuracy in making radiologic clinical decisions for breast cancer screening. The accuracy scores were fairly similar between the two models, with ChatGPT-4 scoring 1.82 (out of 2) on average with open-ended prompts, compared to Bard's score of 1.89. For select all that apply (SATA) prompts, Bard achieved 82 percent accuracy, while ChatGPT-4 offered a slightly higher accuracy of 85 percent.
- Differential performance across cancer types. The study observed differences in the performance of ChatGPT-4 and Bard across different cancer types. Notably, ChatGPT-4 outperformed Bard in average scoring with predefined options in SATA prompts across all cancer imaging, with a more significant difference in ovarian cancer screening. ChatGPT-4 achieved an 83 percent average accuracy score, while Bard scored 70 percent. For ovarian cancer screening specifically, ChatGPT-4 had a higher accuracy in both open-ended and SATA prompts compared to Bard.
- Effectiveness of prompt engineering. The researchers highlighted the importance of prompt engineering in improving the accuracy of responses from large language models (LLMs). While both open-ended (OE) and select all that apply (SATA) prompts were used, the study found that OE prompts were more effective in enhancing performance for both ChatGPT-4 and Bard.
(Editor’s note: For related content, see “Can ChatGPT Pass a Radiology Board Exam?,” “Can ChatGPT Have an Impact in Radiology?” and “Can ChatGPT be an Effective Patient Communication Tool in Radiology?”)
Additionally, while OE prompts improved the performance of both large language models, the study authors did not see similar benefits with the use of SATA prompts. While acknowledging the potential for bias in the training data toward OA prompts, the researchers said the flexibility of OE prompts may be more optimal than SATA prompts.
“ … Our findings support the idea of implementing (prompt engineering) in an OE format to improve the accuracy of the responses in unique clinical settings, such as when imaging modalities are not provided or when clinicians are unable to list all the possible imaging modalities for a given scenario,” added Kim and colleagues.
In regard to study limitations, the study authors conceded the scoring of responses from the LLM models is subjective and noted there were only two scorers for the study. The researchers also noted limitations with general extrapolation of the study findings given the study was focused on assessment of the LLMs in screening for four types of cancer and clinical guidelines established by the ACR.