News|Articles|March 24, 2026

Can Radiologists Identify Fake Radiographs Created with AI?

Author(s)Jeff Hall

Only 41 percent of radiologists spotted radiographs that were generated by the large language model GPT-4o, according to newly published research.

New multicenter research suggests that many radiologists may not be able to identify synthetic radiographs created by large language models (LLMs) such as GPT-4o.

In the study, recently published in Radiology, researchers performed a three-phase study to assess the ability of 17 radiologists and multimodal LLMs to differentiate between authentic radiographs and synthetic radiographs created by GPT-4o (OpenAI).

Initially blinded to the purpose of the study, only seven of the 17 radiologists (41 percent) were able to spot the AI-created radiographs, according to the study authors.

After radiologists were informed about the presence of AI-generated radiographs, the researchers found that radiologists had only 74.8 percent and 70 percent accuracy rates, respectively, in differentiating between authentic and synthetic radiographs in separate datasets that contained AI-generated radiographs.

“These findings underscore the need for clinician training and dedicated tools to mitigate the risks of deepfake radiographs. Misuse could include fraudulent images used in insurance claims, litigation, or Munchausen syndrome, and the fact that one in five AI-generated radiographs escaped expert detection in our study highlights this vulnerability. As LLM image synthesis advances, safeguards will become increasingly critical,” noted lead study author Michael Tordjman, M.D., M.S., who is affiliated with the Department of Diagnostic, Molecular and Interventional Radiology at Mount Sinai Hospital in New York, N.Y.

The study authors also evaluated the ability of LLMs to differentiate between authentic and synthetic radiographs. In the dataset that included GPT-4o-generated images, GPT-4o and GPT-5 provided higher accuracy rates at 85.1 percent and 82.5 percent, respectively, in comparison to Gemini 2.5 Pro (56.5 percent) and Llama 4 Maverick (59.1 percent). However, in a separate dataset with RoentGen-created radiographs, the GPT-4o accuracy rate declined to 75.5 percent, according to the study.

Three Key Takeaways

• Radiologists have limited ability to detect AI-generated radiographs. When blinded, only 41 percent of radiologists identified synthetic images, and even after awareness, accuracy remained modest (~70–75 percent), highlighting a meaningful vulnerability in routine image interpretation.

• AI models are imperfect at detecting synthetic radiographs. While leading multimodal LLMs achieved higher detection accuracy, none reliably identified all synthetic images. Researchers also noted that the performance of GPT-4o dropped when evaluating images generated by other systems, underscoring the lack of generalizability.

• Emerging clinical and medicolegal risk necessitates safeguards. Subtle artifacts (e.g., overly smooth bones with uniform cortical thickness, uniform noise, unnatural soft tissue texture) may provide clues, but inconsistency in detection reinforces the need for radiologist training, validation tools, and workflow safeguards to mitigate risks such as fraud, misdiagnosis, and misuse of synthetic imaging.

“LLMs likewise also showed imperfect performance in distinguishing AI-generated versus real images. Even GPT-4o, which was used to generate the synthetic images, failed to reliably recognize its own outputs. None of the tested LLMs detected all synthetic radiographs,” added Tordjman and colleagues.

The reviewing radiologists noted the most common distinguishing features in the synthetic radiographs included overly smooth bones with uniform cortical thickness (47 percent of the cases); uniform grain or noise (35 percent of the cases); and subtly unnatural soft tissue texture (30 percent of the cases).

(Editor’s note: For related content, see “Generative Vision Language Model for Chest X-Ray Gets FDA’s Breakthrough Device Designation,” “FDA Clears Emerging AI-Powered CXR Software from Qure.ai” and “The Inflection Point for AI in Radiology: Emerging Insights for 2026.”)

In regard to study limitations, the authors conceded the evolving nature of LLMs such asGPT-4o and that future updates may address artifact patterns noted on the LLM-generated images in the current study. The researchers also acknowledged that GPT-4o was utilized to generate and detect synthetic radiographs in the first dataset.


Latest CME