LLM Benchmark of 13,735 Chest Radiographs Advances Cardiothoracic Disease Detection

Researchers have created a rigorously validated benchmark dataset of chest radiographs to accelerate the development of reliable, clinically useful large language models (LLMs) for cardiothoracic disease. Led by Yishu Wei from Weill Cornell Medicine, Adam E. Flanders from Thomas Jefferson University, and Errol Colak from St. Michael’s Hospital/Unity Health Toronto, et al., this new resource , dubbed REVEAL-CXR , comprises 200 chest X-rays meticulously labelled with 12 key benchmark findings. This work addresses a critical need for high-quality, expert-curated datasets to move beyond LLM performance on simple exams and towards real-world diagnostic application, utilising an innovative AI-assisted labelling procedure to enhance radiologist efficiency and ensure comprehensive, accurate annotations. The publicly available dataset, hosted by RSNA, promises to standardise evaluation and drive progress in artificial intelligence for chest imaging , ultimately improving patient care.

REVEAL-CXR dataset for robust AI evaluation

Scientists have unveiled a new benchmark dataset designed to rigorously evaluate the performance of artificial intelligence in medical image analysis, specifically for chest radiography. The research, led by a collaborative team of seventeen radiologists from ten institutions across four countries, addresses a critical gap in the field, the lack of high-quality, expertly curated datasets needed to assess the clinical utility of large language models (LLMs). This work establishes the REVEAL-CXR dataset, comprising 200 chest radiographic studies meticulously verified by multiple radiologists, and introduces an innovative AI-assisted labeling procedure to enhance efficiency and minimise omissions. The team began by utilising a substantial collection of 13,735 deidentified chest radiographs and corresponding reports sourced from the MIDRC.
GPT-4o was employed to extract abnormal findings directly from these reports, which were then mapped to a defined set of 12 benchmark labels using a locally hosted LLM, Phi-4-Reasoning. This initial AI-driven process streamlined the labeling process, identifying 1,000 studies for detailed expert review, carefully selected to ensure clinical relevance and a diverse range of difficulty levels. The sampling algorithm prioritised studies with complex presentations, ensuring the benchmark captures challenging cases crucial for robust model evaluation. Seventeen chest radiologists then participated in a thorough assessment, marking their agreement, “Agree all”, “Agree mostly”, or “Disagree”, with the AI-suggested labels for each radiograph.

Each image underwent evaluation by three independent experts, with a final label assigned only if at least two radiologists concurred on “Agree All”. This rigorous validation process resulted in a refined set of 381 radiographs, from which the final 200, comprising 100 released for immediate use and 100 reserved as a holdout dataset for independent evaluation by the Radiological Society of North America (RSNA), were selected, prioritising those with less common or multiple findings. Experiments demonstrated strong inter-rater reliability, with a Cohen’s κ of 0.622 (95% CI 0.590, 0.651) across all agreement categories. Furthermore, most conditions exhibited even higher agreement, with Cohen’s κ values exceeding 0.75 (ranging from 0.744 to 0.809), except for airspace opacity (κ 0.484, 95% CI [0.440, 0.524]). The publicly available REVEAL-CXR dataset, accessible at https://imaging. rsna. org, promises to accelerate the development and validation of clinically useful LLM tools for cardiothoracic disease, paving the way for improved diagnostic accuracy and patient care.

Radiology Dataset Creation Using LLM Labelling

Scientists developed a novel benchmark dataset and AI-assisted labelling procedure to rigorously evaluate multimodal large language models in radiology. The research team harnessed a total of 13,735 deidentified chest radiographs and corresponding reports sourced from the Medical Imaging and Data Resource Center (MIDRC) to construct this valuable resource. Initially, the team employed GPT-4o to automatically extract abnormal findings directly from the radiology reports, effectively pre-processing the data for subsequent analysis and annotation. These extracted findings were then mapped onto a defined set of 12 benchmark labels using a locally hosted large language model, Phi-4-Reasoning, streamlining the initial labelling process.

Researchers then sampled 1,000 studies from the larger dataset, guided by the AI-suggested benchmark labels, ensuring clinical relevance and a diverse range of difficulty levels were represented. Seventeen chest radiologists participated in the expert review process, assessing the correctness of the LLM-generated labels by selecting “Agree all”, “Agree mostly”, or “Disagree” for each case. Each chest radiograph underwent evaluation by three independent experts, bolstering the reliability and validity of the final annotations. This meticulous approach resulted in 381 radiographs receiving “Agree All” from at least two radiologists, forming the foundation for the curated benchmark.

Subsequently, the study prioritised 200 radiographs, specifically those exhibiting less common or multiple findings, and divided them into a 100-study released dataset and a 100-study holdout dataset for independent model evaluation by the Radiological Society of North America (RSNA). This innovative methodology enabled the creation of a publicly available benchmark, accessible at https://imaging. rsna. org, with each radiograph rigorously verified by three radiologists. Furthermore, the team pioneered an AI-assisted labelling procedure designed to enhance radiologist efficiency, minimise omissions, and facilitate a semi-collaborative labelling environment. The work achieved a Cohen’s κ of 0.622 (95% CI 0.590, 0.651) for agreement categories among the experts, demonstrating substantial inter-rater reliability. At the individual abnormality level, most conditions exhibited a Cohen’s κ above 0.75 (ranging from 0.744, 0.809), with the exception of airspace opacity (κ 0.484, 95% CI [0.440, 0.524]). This carefully constructed benchmark, comprising 200 chest radiographic studies each with one to four labels, provides a robust platform for assessing the performance of AI models in real-world clinical scenarios and will undoubtedly accelerate progress in the field.

REVEAL-CXR benchmark assesses AI abnormality detection in chest

Scientists have developed a new benchmark of 200 chest radiographic studies, each verified by three radiologists, to rigorously evaluate artificial intelligence (AI) models in clinical settings. The research, conducted by the RSNA AI Committee Large Language Model (LLM) Workgroup, addresses a critical need for high-quality, expert-curated datasets for assessing AI performance in abnormality detection on medical images. A total of 13,735 deidentified chest radiographs and corresponding reports from the MIDRC were utilised in the study, forming the foundation for the REVEAL-CXR benchmark. Experiments revealed that GPT-4o successfully extracted abnormal findings from the radiology reports, which were then mapped to 12 predefined benchmark labels using a locally hosted LLM, Phi-4-Reasoning.

From an initial pool of 1,000 studies sampled based on these AI-suggested labels, 381 radiographs achieved consensus, with at least two of three radiologists selecting “Agree All” regarding the correctness of the LLM-suggested labels. The team prioritised studies with less common or multiple finding labels, ultimately creating a released dataset of 100 radiographs and a reserved holdout dataset of 100 radiographs for independent evaluation by RSNA. Measurements confirm a Cohen’s κ of 0.622 (95% CI 0.590, 0.651) for agreement among the expert radiologists across all categories. Detailed analysis at the individual abnormality level showed that most conditions achieved a Cohen’s κ above 0.75, ranging from 0.744 to 0.809, demonstrating strong inter-rater reliability.

Airspace opacity was an exception, with a κ of 0.484 (95% CI [0.440, 0.524]). The resulting benchmark comprises studies each containing one to four benchmark labels, offering a complex and realistic assessment platform for AI models. The breakthrough delivers an AI-assisted labeling procedure designed to enhance radiologist efficiency, minimise omissions, and facilitate a collaborative environment. This procedure enabled the curation of a dataset that focuses on rare conditions and complex cases involving multiple diseases, providing particularly informative data points for model evaluation. The publicly available dataset, accessible at https://imaging. rsna. org, promises to accelerate the development and validation of clinically useful AI tools for chest radiography, ultimately improving patient care.

RSNA LLM Dataset for AI Validation

Scientists have developed a new benchmark dataset of 200 chest radiographic studies, verified by multiple radiologists, to rigorously evaluate artificial intelligence (AI) models in clinical settings. This research addresses a critical gap in the field, as existing datasets often lack expert curation, comprehensive image information, or sufficient radiologist involvement, hindering accurate assessment of AI performance beyond board-style exams. The team created a dataset with 12 benchmark labels, utilising a novel AI-assisted labelling procedure to enhance radiologist efficiency and minimise omissions during the annotation process. The significance of this work lies in providing a high-quality, publicly available resource, the RSNA LLM Benchmark Dataset: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR), for the development and validation of clinically useful large language models (LLMs) in radiology.

The dataset’s focus on both image interpretation and report analysis, coupled with input from 17 international radiologists, aims to reduce bias and improve the representativeness of the labelled data. Inter-rater reliability, measured by Cohen’s Kappa, demonstrated substantial agreement among experts, particularly for most radiographic abnormalities, indicating the robustness of the labelling process. The authors acknowledge that the sample size of 200 studies is a limitation, although they maintain it is clinically relevant and captures a range of difficulty levels. Future research directions include expanding the dataset to encompass a wider range of imaging modalities and abnormalities, as well as exploring the potential of the AI-assisted labelling procedure to further streamline the annotation process and facilitate large-scale data curation. This benchmark represents a valuable step towards ensuring that AI tools in radiology are thoroughly evaluated and ultimately contribute to improved patient care.

👉 More information
🗞 RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)
🧠 ArXiv: https://arxiv.org/abs/2601.15129

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Control Methods Gain Stability Against Hardware Errors with New Optimisation Technique

Mathematical Analysis Confirms a Long-Standing Conjecture About Special Function Values

February 14, 2026
Quantum Architecture Shrinks Computing Needs to under 100 000 Qubits

Machine Learning Now Personalises Treatment Effects from Complex, Continuous Data

February 14, 2026
Researchers Develop Systems Equating 2 Diagram Classes with Verified Term Rewriting Rules

Researchers Develop Systems Equating 2 Diagram Classes with Verified Term Rewriting Rules

February 14, 2026