The challenge of accurately interpreting mammograms represents a critical need in breast cancer screening, and researchers are now investigating whether large language models can assist with this complex task. Qiang Li, Shansong Wang, and Mingzhe Hu, all from the Winship Cancer Institute at Emory University School of Medicine, alongside their colleagues, systematically evaluated the latest iteration of OpenAI’s language model, GPT-5, against its predecessor, GPT-4o, and human experts in assessing mammographic images. Their work demonstrates that GPT-5 achieves the highest performance among the GPT family across multiple public datasets, showing promising results in tasks such as identifying breast density, detecting abnormalities, and classifying malignancy. However, the team’s findings also reveal that, despite significant improvements over GPT-4o, GPT-5 still falls short of human expert accuracy, highlighting the need for further refinement before these models can reliably support high-stakes clinical decision-making in mammography.
GPT-5 Benchmarked for Mammography Analysis
The research investigates how large language models, specifically GPT-5, can be applied to medical image analysis, focusing on mammography datasets including EMBED, InBreast, CMMD, and CBIS-DDSM. These tasks, BI-RADS assessment, abnormality detection, and malignancy classification, are critical components of breast cancer diagnosis and screening. Evaluations focused on the model’s ability to classify breast density, detect abnormalities, and determine if a finding is likely to be malignant. Across the EMBED dataset, GPT-5 achieves 56. 8% accuracy in density assessment, 52.5% in distortion analysis, 64. 5% in mass classification, 63. 5% in calcification classification, and 52. 8% in malignancy classification. Across the datasets, GPT-5 achieved BI-RADS accuracy ranging from 36.9% to 69. 3%, abnormality detection rates from 32. 3% to 66. 0%, and malignancy classification accuracy from 35. 0% to 58.2%. While these results represent a significant improvement over previous models, GPT-5 still lags behind the performance of human experts in both sensitivity and specificity, achieving 63. 5% and 52. 3% respectively.
Visual Question Answering for Mammogram Interpretation
Researchers employed a novel approach to assess the capabilities of large language models, specifically GPT-5, in interpreting mammograms and assisting with breast cancer screening. Rather than asking the model to freely describe images, they constructed a system of visual question answering (VQA) based on structured data from four publicly available mammography datasets: EMBED, InBreast, CMMD, and CBIS-DDSM. This method transforms image interpretation into a task of answering specific, clinically relevant questions, such as determining BI-RADS density or identifying lesion types. To achieve this, the researchers meticulously harmonized the metadata and annotations from each dataset into a standardized format, enabling the automatic generation of questions linked to definitive clinical labels. This process ensured a direct correspondence between the question, the correct answer, and established medical findings, minimizing ambiguity and enhancing the reproducibility of the evaluation. The study employed a method of automatically generating questions based on structured data associated with the images, focusing on well-defined clinical tasks and ensuring a clear link between the question, answer, and verified clinical labels.
GPT-5 Excels at Mammogram Interpretation Tasks
Researchers have been investigating the potential of large language models, specifically the GPT-5 family, to assist in the interpretation of mammograms, a crucial step in breast cancer screening. The results demonstrate that GPT-5 consistently outperformed earlier GPT versions, achieving the highest scores among the models tested on tasks such as identifying masses and calcifications, and classifying malignancy. Despite not yet matching expert performance, the substantial gains observed when comparing GPT-4o to GPT-5 suggest a promising trajectory for the application of general language models in mammography. The research highlights the potential for these models to assist radiologists, though further development and targeted training are necessary before they can be reliably used in high-stakes clinical settings.
GPT-5 Performance on Mammography Image Analysis
This study evaluated the performance of large language models, specifically the GPT-5 family and GPT-4o, on clinically relevant tasks involving mammography images. Results demonstrate that GPT-5 consistently outperformed GPT-4o, achieving the highest scores among the GPT variants in classifying density, distortion, masses, calcifications, and malignancy. Despite these limitations, the significant performance improvement from GPT-4o to GPT-5 suggests that generalist multimodal large language models have the potential to assist with mammography visual question answering and ultimately improve patient care.
👉 More information
🗞 Is ChatGPT-5 Ready for Mammogram VQA?
🧠 ArXiv: https://arxiv.org/abs/2508.11628
