In what may currently be the largest study comparing artificial intelligence (AI) software to radiologist interpretation for digital mammography (DM) and digital breast tomosynthesis (DBT), AI offered comparable negative predictive value (NPV) but significantly higher recall rates and false positive results as well.
For the retrospective study, recently published in the American Journal of Roentgenology, researchers assessed the performance of AI software (Transpara v1.7.1, ScreenPoint Medical) in comparison to radiologist evaluation for 26,693 DM screening exams and 4,824 DBT screening exams. The study authors evaluated different diagnostic thresholds for the AI software (including an elevated risk cohort and an intermediate/elevated risk cohort).
The researchers found that the AI software at both thresholds offered comparable NPV for DM (99.8 percent for elevated risk and 99.9 percent for intermediate-elevated risk versus radiologist interpretation (99.9 percent). There were similar results in the DBT cohort with both AI thresholds having a 99.8 percent NPV in comparison to 99.9 for radiologist assessment, according to the study authors.
“The high NPV of AI, along with the high proportions of mammograms that AI classified as low risk (58.2% for DM, 68.1% for DBT), suggests utility of AI to allow radiologists to streamline their interpretation of negative examinations while prioritizing more complex cases. Such an approach could substantially improve workflow efficiency, reduce interpretation fatigue, and better allocate healthcare resources,” wrote lead study author Iris E. Chen, M.D., who is affiliated with the Department of Radiology at the University of California, Log Angeles (UCLA), and colleagues.
The study authors pointed out that the missed cancers with AI were commonly smaller in size and did not have microscopic nodal disease.
“Overall, given the characteristics of the small number of AI-missed cancers, radiologists can likely trust low-risk AI results during their interpretations without risking patient harm in the context of an annual screening program,” added Chen and colleagues.
However, while AI for the intermediate/elevated diagnostic threshold offered the highest sensitivity for DM (94 percent) and DBT (89.2 percent), radiologist interpretation provided significantly higher specificity (93.3 percent and 93.7 percent) than both AI thresholds for DM as well as DBT.
Recall rates for radiologist assessment were significantly lower for DM and DBT in contrast to AI evaluation, according to the study authors. For DM, the recall rate for radiologist reading was 7.2 percent in comparison to 14 percent for AI at the elevated risk threshold and 41.8 percent for AI at the intermediate/elevated threshold.
Three Key Takeaways
- High negative predictive value (NPV). AI software demonstrated comparable NPV to radiologists (≈99.8–99.9%) for both digital mammography (DM) and digital breast tomosynthesis (DBT), supporting potential utility in safely triaging low-risk exams.
- Workflow efficiency potential. With over half of exams classified as low risk by AI (58.2 percent DM, 68.1 percent DBT), AI could help streamline interpretation, reduce radiologist fatigue, and optimize resource allocation without compromising patient safety in annual screening settings.
- Higher false positives and recalls. Despite high sensitivity at the intermediate/elevated risk threshold, AI had significantly higher recall rates and more false positives than radiologists, raising concerns about unnecessary recalls and automation bias if radiologists over-rely on AI output.
The researchers also noted over double the number of false positive results with the AI intermediate risk category for DM in comparison to the AI elevated risk category (7,365 vs. 3,625).
“ … Recall rates would increase markedly if interpreting radiologists were to feel compelled to recall all examinations flagged as intermediate risk by AI. Even if radiologists were to be selective regarding the reporting of intermediate-risk AI results, automation bias — the tendency for humans to defer to computers over their own expertise — could contribute to many potentially unnecessary recalls,” pointed out Chen and colleagues.
(Editor’s note: For related content, see “Can AI Assessment of Microcalcifications on Mammography Improve Differentiation of DCIS and Invasive Ductal Carcinoma?,” “Large Mammography Study Affirms Value of AI in Breast Cancer Detection” and “Reducing Mammography Workload by Nearly 40 Percent? What a New Hybrid AI Study Reveals.”)
Beyond the inherent limitations of a single-center retrospective study, the authors acknowledged a lack of adjunctive AI assessment and lesion-level evaluations and noted the use of a single vendor for all mammography exams.