News|Articles|March 13, 2024

What New Research Reveals About ChatGPT and Ultrasound Detection of Thyroid Nodules

In a comparison of image-to-text large language models (LLMs), ChatGPT 4.0 offered a 95 percent sensitivity rate and an 83 percent AUC that were comparable to that of two senior radiologists and one junior radiologist interacting with LLM to differentiate between malignant and benign thyroid nodules on ultrasound.

Emerging ultrasound (US) research suggests the integration of large language models (LLMs), such as ChatGPT 4.0, with image-to-text interpretation may offer comparable assessment to the combination of radiologists and LLMs in diagnosing thyroid nodules.

For the retrospective study, recently published in Radiology, researchers compared ChatGPT 3.5 (OpenAI), ChatGPT 4.0 (OpenAI) and Gemini (Formerly known as Bard, Google) in the assessment of thyroid nodules on 1,161 ultrasound images from a total of 725 patients (mean of 42.2 years). The study authors also compared LLM performance with image-to-text interpretation, a combination of LLM and human assessment (Human-LLM), and a convolutional neural network (CNN) analysis, which was trained on over 18,000 images.

For the image-to-text approach, the researchers noted a 79 percent intra-LLM agreement for Gemini, which was 8 percent higher than ChatGPT 4.0 (71 percent) and 30 percent higher than ChatGPT 3.5 (49 percent). The study authors also pointed out a 75 percent inter-LLM agreement on image-to-text interpretation between Gemini and ChatGPT 4.0.

For predicting benign versus malignant thyroid nodules on ultrasound, Gemini and ChatGPT had comparable AUCs for image-to-text interpretation and human-LLM assessments by the four reviewing radiologists, according to the study. However, the researchers noted differences between the two LLMs with respect to sensitivity and specificity rates.

With the image-to-text approach, ChatGPT 4.0 offered a 95 percent sensitivity rate in comparison to 87 percent for Gemini. In the Human-LLM interaction model, the use of ChatGPT led to 12 percent, 17 percent, and 7 percent higher sensitivity for three of the four readers (two junior radiologists and one senior radiologist) in comparison to Gemini. However, the researchers said the use of Gemini resulted in higher specificity for image-to-text assessment (75 percent vs. 71 percent) and for human-LLM interactions for all readers (including a 12 percent increase for one junior radiologist).

“This study affirms — to our knowledge for the first time — the feasibility of LLMs in handling the reasoning questions associated with medical diagnosis using the reference standard of pathologic findings within the structured domain of US imaging–based diagnosis,” wrote senior author Wei Wang, M.D., Ph.D., who is affiliated with the Department of Medical Ultrasonics and the Ultrasomics Artificial Intelligence X-Laboratory within the Institute of Diagnostic and Interventional Ultrasound at the First Affiliated Hospital of Sun Yat-Sen University in Guangzhou, China, and colleagues.

The study authors emphasized that LLMs are dependent upon human interpretation. While the researchers found image-to-text interpretation with LLMs was either generally comparable or higher than that of a human-LLM combination approach, they noted that one senior reader had four percent higher AUC, accuracy, sensitivity, and specificity rates with LLM in comparison to image-to-text interpretation with Gemini.

“This emphasizes the ongoing indispensable role of human expertise, despite artificial intelligence advances in medical imaging and diagnostics,” added Wang and colleagues.

Three Key Takeaways

Comparative performance. The study indicates that large language models (LLMs), specifically ChatGPT 4.0 and Gemini, integrated with image-to-text interpretation show comparable performance to the combination of radiologists and LLMs in diagnosing thyroid nodules on ultrasound. This suggests that LLMs can play a significant role in medical image interpretation, potentially reducing the need for a combined assessment involving human experts.

Sensitivity and specificity differences. While Gemini and ChatGPT demonstrated comparable area under the curve (AUC) for predicting benign versus malignant thyroid nodules, there were differences in sensitivity and specificity rates. ChatGPT 4.0 exhibited higher sensitivity rates, particularly in the image-to-text approach, making it potentially more effective in identifying true positive cases. However, Gemini showed higher specificity, indicating a better ability to correctly identify true negative cases.

Incorporating radiologist expertise and AI Integration. Noting that one senior reader exhibited higher sensitivity, accuracy and specificity with the human-LLM model in comparison to the image-to-text capability of Gemini, the study authors emphasized the importance of radiologist experience. Based on some of the study findings, the researchers also noted that LLMs may play a role in improving the diagnostic consistency of radiologists with less experience.

The CNN model did offer a higher AUC (88 percent) as well as higher accuracy (89 percent vs. 84 percent and 82 percent) and specificity rates (81 percent) than image-to-text approaches with ChatGPT 4.0 and Gemini.

However, noting similar sensitivity between ChatGPT’s image-to-text approach and the CNN model (95 percent), the study authors suggested that LLMs offer an enhanced transparency into diagnostic decisions that may bolster the consistency of less experienced radiologists.

“Particularly beneficial for junior doctors, using LLMs enhances symptom recognition and diagnosis understanding, representing a promising avenue for incorporating artificial intelligence into medical diagnosis,” posited Wang and colleagues.

(Editor’s note: For related content, see “Can ChatGPT and Bard Bolster Decision-Making for Cancer Screening in Radiology?,” “Pediatric Thyroid Nodules on Ultrasound: Deep Learning Model and TI-RADS Show Higher Sensitivity than Radiologist Assessment” and “CT Update: FDA Changes Course on Post-ICM Thyroid Monitoring in Young Children.”)

In regard to study limitations, the researchers acknowledged that assessment of the complex analysis capabilities of the LLMs included in the study may have been limited given that diagnosis with those models was geared to TI-RADS with limited signs. The study authors also conceded that the adopted voting mechanism may not provide an accurate assessment of the error rate.

Stay at the forefront of radiology with the Diagnostic Imaging newsletter, delivering the latest news, clinical insights, and imaging advancements for today’s radiologists.

Subscribe Now!

Roy S. Herbst, MD, PhD; Sandip Patel, MD, FASCO; Heather A. Wakelee, MD, FASCO

Video

Enhancing Prostate Cancer Outcomes – The Role of PSMA and Targeted Treatment Strategies

Ana Kiess, MD, PhD; Erin Grady, MD, CCD, FACNM, FSNMMI; Himanshu Nagar, MD, MS; Scott T. Tagawa, MD, MS, FASCO, FACP

What New Research Reveals About ChatGPT and Ultrasound Detection of Thyroid Nodules

Three Key Takeaways

Newsletter

Related Content

Diagnostic Imaging’s Weekly Scan: October 26 — November 1

Emerging Trends with Radiology Practice Closures Point to Increased Subspecialization

Sirona Medical Receives FDA Clearance for Advanced Imaging Suite

Emerging Research and News in Prostate Cancer Imaging — October 2025

Emerging Research and News in Breast Imaging — October 2025

Latest CME

26th Annual International Lung Cancer Congress

Community Oncology Connections™: Beyond the Basics—Revolutionizing Advanced Prostate Cancer Management With PSMA-Targeted Therapies | South Carolina

43rd Annual CFS®: Innovative Cancer Therapy for Tomorrow™

20th Annual New York Lung Cancers Symposium®

Community Oncology Connections™: Beyond the Basics—Revolutionizing Advanced Prostate Cancer Management With PSMA-Targeted Therapies | West Virginia

Community Oncology Connections™: Beyond the Basics—Revolutionizing Advanced Prostate Cancer Management With PSMA-Targeted Therapies | Tennessee

Inaugural Brain & Spine Metastases Conference: Evolving Practice and Emerging Therapies

2nd Annual Hawaii Cancer Conference

23rd Annual Winter Lung Cancer Conference®

43rd Annual Miami Breast Cancer Conference®

19th Annual New York GU Cancers Congress™

Mastering Advances in Managing Unresectable and Metastatic NSCLC—Immunotherapy, Targeted Therapies, and Emerging Strategies

Cases & Conversations™: Expert Perspectives on Leveraging Recent Advances to Transform SCLC Treatment

Community Practice Connections™: Empowering Interventional Radiologists in the Emerging Era of Oncolytic Immunotherapies for Melanoma

17th Annual International Symposium on Ovarian Cancer and Other Gynecologic Malignancies™

(CME Credit) Advancing Outcomes in Limited-Stage Small Cell Lung Cancer: From Evidence to Practice

PER Tumor Board®: Applying Recent Advances to Transform the Treatment Paradigm in SCLC—Expert Perspectives on New Approvals and Emerging Strategies

Ready for Radioligand Therapy? Patient Selection and Sequencing Simplified

Radioligand Therapy 101: The Science Behind the Strategy

Working Together: Overcoming Barriers to Optimize Outcomes in Patients Treated With Radioligand Therapy Through Multidisciplinary Care

BURST CME™ Resource Center: Integrating Novel PSMA-Directed Radioligand Approaches for Diagnosis and Management of Prostate Cancer

Community Practice Connections™: Beyond the Basics— Revolutionizing Advanced Prostate Cancer Management With PSMA-Targeted Therapies

2026 International Symposium of Gastrointestinal Oncology (ISGIO)

26th Annual International Lung Cancer Congress

Enhancing Prostate Cancer Outcomes – The Role of PSMA and Targeted Treatment Strategies

Trending on Diagnostic Imaging

Diagnostic Imaging’s Weekly Scan: October 26 — November 1