Large Language Models Achieve Precision with MIMIC III’s Four Tables and Clinical Notes

Researchers are increasingly exploring how Large Language Models (LLMs) can unlock the wealth of information hidden within Electronic Health Records (EHRs). Juan Jose Rubio Jan from King’s College London, alongside Jack Wu and Julia Ive from University College London, demonstrate a novel approach to both querying structured data and extracting insights from unstructured clinical text using LLMs and Retrieval-Augmented Generation (RAG) pipelines. This work is significant because it tackles two core challenges in clinical data science , accessing and interpreting complex medical information , and proposes an automated evaluation framework to rigorously test LLM performance on a curated subset of the MIMIC III database. Their findings reveal the considerable potential of LLMs to enhance precision and accuracy in clinical workflows, paving the way for more effective data-driven healthcare decisions.

The research team achieved precise querying of large structured datasets using programmatic languages like Python/Pandas, alongside reliable extraction of semantic information from free-text health records via a Retrieval-Augmented Generation (RAG) pipeline, a combination previously challenging to implement with consistent accuracy. This breakthrough reveals the potential for LLMs to transform clinical workflows by automating complex data analysis and knowledge discovery from patient records. To rigorously test these capabilities, the researchers developed a flexible evaluation framework that automatically generates synthetic question and answer pairs, tailored to the unique characteristics of each dataset and task.

This innovative approach bypasses the limitations of relying solely on existing annotated data, which is often scarce in the medical domain, and allows for comprehensive assessment of LLM performance across diverse clinical scenarios. Experiments were conducted on a curated subset of the MIMIC-III database, encompassing four structured tables and one clinical note type, utilising both locally hosted and API-based LLMs to ensure broad applicability and scalability. The evaluation process combined exact-match metrics, semantic similarity assessments, and crucially, human judgment to validate the clinical relevance and accuracy of the LLM-generated responses. The study establishes that LLMs can effectively interact with complex structured clinical datasets, performing precise queries and analytics with a high degree of accuracy.
Findings demonstrate the ability of these models to translate natural language questions into executable code, enabling efficient data retrieval and analysis, a capability vital for clinical decision support and research. Furthermore, the research proves that when coupled with RAG, LLMs significantly improve the accuracy of information extraction from unstructured clinical notes, mitigating the risk of “hallucinations” and ensuring the reliability of extracted insights. This is particularly important in healthcare, where factual correctness is paramount for patient safety and research integrity. This work opens new avenues for leveraging LLMs to unlock the full potential of EHR data, facilitating more informed clinical decision-making and accelerating medical research.

The team’s innovative evaluation framework, generating synthetic test cases, is a generalisable method applicable to any healthcare use case, providing a robust means of assessing LLM performance and ensuring clinical validity. By demonstrating the feasibility of both precise querying and accurate information extraction, this research illustrates the readiness of current models for real-world healthcare applications, while also highlighting the practical complexities of deployment and the need for ongoing refinement. The research team engineered a flexible evaluation framework that automatically generates question and answer pairs, specifically tailored to the characteristics of each dataset and task, ensuring robust and relevant testing. Experiments were conducted on a curated subset of the MIMIC-III database, utilising both locally hosted and API-based LLMs to assess performance across different deployment scenarios. Evaluation metrics combined exact-match accuracy, semantic similarity scores, and expert human judgment to provide a comprehensive assessment of LLM capabilities.

The study pioneered a method for creating domain-specific evaluation datasets by synthetically generating questions with known answers for both code generation and information extraction tasks, mirroring the approach of Ding et al. (2025) with NICE guidelines. Researchers focused on four structured tables, Patients, Prescriptions, Diagnoses, and D_ICD_Diagnoses, alongside the NoteEvents table for unstructured data, creating a focused dataset for prototype development. To manage computational constraints, the analysis was restricted to 101 randomly selected patients from the Patients table, resulting in a dataset of 274,022 records with 23 features after merging tables using SUBJECT_ID and ICD9_CODE as join keys. To address data anonymisation issues within MIMIC-III, the team synthetically generated a DOB_Demo field using the Faker and datetime libraries, creating realistic dates for analytical purposes and resulting in a final dataset dimension of (274,022, 23).

For unstructured data, a single clinical note was selected from the NoteEvents table, chosen for its completeness and representative content, and converted into a generic text file for evaluation. The approach enables reliable retrieval and summarisation of clinically relevant information, while also illustrating the practical complexity of deploying such technologies in clinical contexts. Scientists designed 30 prompts of varying complexity for the structured dataset, accounting for data preprocessing, aggregation tasks, and operational requirements, for example, comparing a simple “What is the median age?” prompt to a more complex “What is the median age of female subjects?” prompt. For the unstructured dataset, a GPT-5 model segmented the selected clinical note into 50 semantically coherent chunks of 400 tokens each, with a 50-token overlap, employing the MiniLM-L6-v2 embedding model and FAISS vector index for efficient retrieval. Experiments were conducted on a curated subset of the MIMIC-III database, encompassing four structured tables, Patients, Prescriptions, Diagnoses, and D_ICD_Diagnoses, and one clinical note type from the NoteEvents table, utilising both locally hosted and API-based LLMs. The team meticulously analysed 274,022 total records with 23 features, derived from data relating to 101 unique patients, to assess LLM performance. Results demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows, exceeding traditional baselines in off-the-shelf evidence detection tasks like pediatric depression detection and postpartum hemorrhage identification.

The study involved the creation of synthetic test data, including a DOB_Demo field generated using Faker and datetime libraries, to address data anonymisation issues within MIMIC-III and ensure analytical meaningfulness. This resulted in a dataset with dimensions of 274,022 records and 23 features, allowing for robust analysis despite the one-to-many relationships within the clinical data. Tests prove the effectiveness of the developed evaluation framework, which automatically generates question and answer pairs tailored to the characteristics of each dataset and task. For structured data, scientists designed 30 prompts of varying complexity, ranging from simple median age calculations to conditional filtering and aggregation, to comprehensively evaluate the pipeline’s capabilities.

A complex prompt, such as “What is the median age of female subjects?”, required both conditional filtering and aggregation, pushing the LLMs to demonstrate advanced analytical skills. Measurements confirm that larger models generally achieve superior results in information extraction tasks, while even smaller models can assist in these processes, offering flexibility in deployment scenarios. The unstructured data component involved segmenting a clinical note into 50 semantically coherent chunks, each serving as the basis for targeted question generation; for example, the question “Why was the pre-surgical physical exam not obtained?” was directly derived from the sentence: “Physical exam prior to surgery was not obtained since the patient was intubated and sedated. ” This approach ensured questions were grounded in the source text, facilitating accurate evaluation and highlighting the potential for reliable retrieval and summarisation of clinically relevant information.

👉 More information
🗞 Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science
🧠 ArXiv: https://arxiv.org/abs/2601.20674

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Draincode Achieves 85% Latency Increase Via RAG Context Poisoning Attacks

Draincode Achieves 85% Latency Increase Via RAG Context Poisoning Attacks

January 30, 2026
Multimodal Planning Agent Achieves Efficient Visual Question-Answering with Dynamic mRAG Pipelines

Multimodal Planning Agent Achieves Efficient Visual Question-Answering with Dynamic mRAG Pipelines

January 30, 2026
Accuracy Gains: Programming Knowledge Graphs Advance Complex Code Generation

Accuracy Gains: Programming Knowledge Graphs Advance Complex Code Generation

January 30, 2026