Feature|Articles|December 9, 2025

Clinical Applications of LLMs in Radiology: Key Takeaways from RSNA 2025

Reviewing an array of study presentations from the recent RSNA meeting, this author discusses insights from emerging research, ranging from the use of large language models (LLMs) in radiology reports to the potential impact of LLM assistance in differential diagnosis for brain MRIs.

Many of us are still in the reflective stage after RSNA 2025 as we attempt to summarize all the fascinating AI-focused sessions, panels, and hallway discussions. Throughout the seven imaging informatics sessions this year, a clear message became apparent. In particular, the discussions surrounding LLMs seemed a little broader and clinically grounded than in previous years.

As a co-moderator of a scientific session focused on clinical applications of LLMs in radiology, I had the opportunity to see several real-world studies presented side by side. Accordingly, I would like to share a few takeaways from this session for radiologists and imaging leaders.

How To Mine Value from the Complex Reality of Radiology Reports

One of the presentations highlighted a familiar truth. While radiology reports remain one of the richest sources of clinically annotated information, manual extraction of such diagnoses or measurements is currently relatively inefficient and inconsistent.

In a multi-institutional effort involving six centers, including the Mayo Clinic, the University of California-San Francisco (UCSF), Massachusetts General Brigham, the University of California-Irvine (UC Irvine), the Moffitt Cancer Center, and Emory University, a team of investigators evaluated whether LLMs can reliably extract key diagnostic labels across multiple modalities and conditions.1 They used radiology reports covering five diagnostic categories: liver metastasis on abdominal CT, subarachnoid hemorrhage on brain CT, pneumonia on chest radiographs, cervical spine fracture on CT, and glioma progression on brain MRI, with diagnoses manually verified by a radiologist.

Several findings from this presentation stood out to me. Instruction-tuned larger models outperformed smaller chat-style variants. Even after harmonization, performance varied across centers. Models performed worse on pneumonia than on fractures or liver metastasis.

Another underappreciated operational nuance highlighted during this discussion is that a center’s lower performance often reflects more variable radiology report structure, not differences in clinical expertise.

The Emory University center differed from the others in that its reports were from earlier dates and involved a larger group of radiologists while the other centers, on average, reflected reports from fewer radiologists.

Making the Case for Small Local Multitask Models

The next presentation by researchers from Johns Hopkins University challenged that larger language models were always better. They argued that there are core challenges or barriers with LLM adoption in the clinical setting.2

For example. most cloud-based LLMs require sensitive patient data to be sent outside the hospital environment into a cloud environment. Additionally, the underlying training data that was used for training the LLMs is somewhat opaque. Also, prompt engineering, which is necessary to get reliable output from LLMs, is not something clinicians know about intuitively.

(Editor's note: For additional coverage of the RSNA conference, click here.)

The study authors proposed that small language models (SLMs) that can be trained to perform tasks that are most relevant to a certain hospital using data that is relevant to that environment. For example, the OPT-350M model can be fine-tuned using low-rank adaptation methods that incorporate data from the local hospital setting. Their work evaluated the performance of the fine-tuned SLM for three tasks: medical report labeling, DICOM metadata harmonization, and impression generation from findings.

One of the most interesting findings from this work was that both single-task and multi-task small language models outperformed LLMs such as GPT-4o. Additionally, such SLMs can be trained and deployed on a central processing unit (CPU) without the need for an expensive graphics processing unit (GPU) and cloud compute.

Brain MRIs, the ‘Expertise Paradox’ and Human-AI Interactions

An interesting study by researchers from the Technical University of Munich (TU Munich) evaluated how LLM assistance influences performance in brain MRI differential diagnosis in readers with varying experience.3

They collected differential diagnosis findings from four neurology/neurosurgery residents, four radiology residents, and four neuroradiologists. They found that with an increase in reader experience, the absolute LLM performance improved. However, this gain in performance diminished as experience of the expert human increased.

For example, the neurology/neurosurgery residents made poor image descriptions that benefited significantly from the help of LLMs. Radiology residents benefited moderately but there was almost no benefit for neuroradiologists. The presenters discussed that this could be due to a ceiling effect in neuroradiologists' performance, which LLMs cannot further improve. Given that junior radiologists benefit more than expert neuroradiologists, certain guardrails need to be in place to avoid introducing anchoring biases where the AI predictions are only shown after the inputs are collected from the readers.

This talk also highlighted the importance of evaluating performance metrics from a human-AI perspective in which metrics such as diagnostic agreement, time to diagnosis, and clinically relevant error types are also explored along with model accuracy.

What One Study Revealed About LLMs and Decision Support in Oncology

The final presentation of the session evaluated LLMs in a high-stakes setting: drafting National Comprehensive Cancer Network (NCCN)-aligned multidisciplinary team tumor board management plans for pancreatic ductal adenocarcinoma (PDAC).4

The goal of this research was to evaluate the performance of closed-source models such as GPT-4o with that of open-weight models such as DeepSeek-V3. The researchers found that in certain cases, closed-source LLMs actually denied a response without full completion. While DeepSeek-V3 completed all cases, the researchers noted that GPT-4o fell slightly short of full completion. The discordance rates were significantly higher for GPT-4o than for DeepSeek-V3.

From my perspective, the most important learning came from their analysis of errors. The authors showed that GPT-4o tended to provide overtreatment recommendations, such as suggesting surgery in unresectable disease, while DeepSeek-V3 tended to provide more conservative miss patterns, which might be easier to catch with human guardrails.

Overall, the researchers proposed local or deployable models as feasible copilots for time-intensive tumor board workflows, provided they are deployed with explicit guidelines, rationale, and clinical review.

Where the Field Seems to be Headed

As we look at all these studies and findings together, they suggest what the next phase of radiology LLM adoption may look like. LLM adoption in radiology is entering a more realistic phase with fewer grand claims about capabilities and more specific, testable workflow improvements. A reasonable near-term path may encompass the following tenets.

Data and documentation quality matter. LLM performance is not purely a model property but also reflects institutional reporting practices and category-specific ambiguity.

Smaller local models are emerging as contenders for focused tasks that prioritize privacy, cost control, and maintainability.

• Human–AI interaction design may be the real differentiator. Early studies suggest the marginal impact of LLMs in interpretive workflows may vary by experience level.

• Safety first evaluation is non-negotiable. For LLM applications in the clinical area, especially with management of recommendations in oncology, safety first evaluation with guardrails, error analysis, and reasoning are critical.

Final Notes

The central question is no longer whether language models can be useful in radiology. The more urgent questions are where LLMs should be used, under what constraints, and how to engineer workflows that harness their benefits without introducing new risks.

Mr. Shanker is an AI and medical imaging researcher. He is a co-founder of Rad-Lab.ai, a startup company focusing on clinical AI solutions for imaging.

References

  1. Moassefi M, Houshmand S, Faghani S, et al. Engineering prompts, extracting diagnoses: a multi-institutional assessment of LLMs in radiology. Presented at the Radiological Society of North America (RSNA) annual meeting, November 30-December 4, 2025, Chicago.
  2. Zheng G, Kamel P, Jacobs MA, Braverman V, Parekh V. One SLM is all you need: adaptive, privacy-preserving small language models for multi-task clinical assistance. Presented at the Radiological Society of North America (RSNA) annual meeting, November 30-December 4, 2025, Chicago.
  3. Schramm S, Le Guellec B, Ziegelmayer S, et al. The expertise paradox: who benefits from LLM-assisted brain mRI differential diagnosis? Presented at the Radiological Society of North America (RSNA) annual meeting, November 30-December 4, 2025, Chicago.
  4. Jajodia A, Gupta K, Latinovich MF, Patlas MN, Elbanna KY. Optimizing clinical decision-making in pancreatic cancer: the role of GPT-4o and DeepSeek V3 large language models. Presented at the Radiological Society of North America (RSNA) annual meeting, November 30-December 4, 2025, Chicago.

Newsletter

Stay at the forefront of radiology with the Diagnostic Imaging newsletter, delivering the latest news, clinical insights, and imaging advancements for today’s radiologists.


Latest CME