Deepseek-R1-Distill-Llama-70b Achieves 12-Dataset Benchmark for Causal Discovery

Researchers are increasingly exploring the potential of large language models (LLMs) to understand causal relationships, a crucial skill for reliable application in fields such as biomedicine. Sydney Anuyah, Sneha Shajee-Mohan, and Ankit-Singh Chauhan, all from Indiana University, alongside Sunandan Chakraborty et al., have rigorously benchmarked 13 open-source LLMs on their ability to perform pairwise causal discovery (PCD) from text. Their work, utilising a novel benchmark comprising 12 diverse datasets, assesses both causal detection and extraction , identifying whether a causal link exists and pinpointing the specific cause and effect , and reveals significant deficiencies in current models. Despite testing various prompting techniques, including Chain-of-Thought, the best performing models achieved scores below 50%, highlighting a critical gap in LLM reasoning capabilities and underscoring the need for improved methods to handle complex, real-world causal inference tasks.

Their benchmark employed 12 diverse datasets to evaluate two core skills, Causal Detection, identifying the presence of a causal link within text, and Causal Extraction, pinpointing the exact cause and effect phrases. The study meticulously tested various prompting methods, ranging from simple zero-shot instructions to more sophisticated techniques like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL), to determine the most effective approach for eliciting causal reasoning.

The results reveal significant deficiencies in the current generation of LLMs, highlighting a major hurdle in their reliable application to complex domains. The top-performing model for causal detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57%, indicating a limited ability to accurately identify causal relationships. Similarly, Qwen2.5-Coder-32B-Instruct, the best model for causal extraction, reached just 47.12%, demonstrating challenges in precisely isolating cause-and-effect phrases. Experiments showed models excelled at identifying simple, explicit causal relationships expressed in single sentences, but their performance dramatically decreased when confronted with more realistic and challenging scenarios.
Specifically, performance plummeted when dealing with implicit relationships, causal links spanning multiple sentences, or texts containing multiple causal pairs. To facilitate reproducible research, the team constructed a unified evaluation framework built upon a dataset validated with high inter-annotator agreement (κ ≥ 0.758). They have made all data, code, and prompts publicly available, fostering further investigation and innovation in this crucial area of artificial intelligence. This work establishes a robust methodology for evaluating LLMs’ ability to discern causation, a vital step towards ensuring their safe and effective integration into critical applications like healthcare and scientific discovery.

The research establishes a clear need for improved LLM architectures and training strategies to enhance their causal reasoning capabilities. The study unveils that while LLMs demonstrate promise in processing unstructured data, their ability to move beyond pattern matching to genuine causal inference remains limited. This is particularly concerning in fields like biomedicine, where misinterpreting correlation as causation could have severe consequences for clinical decision-making. By providing a comprehensive benchmark and publicly accessible resources, the team aims to spur further research and development of LLMs capable of reliable and trustworthy causal discovery. The work opens avenues for creating AI systems that can not only analyze data but also understand the underlying mechanisms driving observed phenomena, ultimately leading to more informed and effective interventions.

LLM Causal Discovery via Textual Benchmarking

Scientists investigated the capacity of thirteen open-source large language models (LLMs) to perform pairwise causal discovery (PCD) from textual data, a critical capability for safe deployment in high-stakes domains such as biomedicine. The research team engineered a comprehensive benchmark using twelve diverse datasets to evaluate two core competencies: Causal Detection, which assesses whether a causal relationship is present in a text, and Causal Extraction, which requires identifying the precise cause and effect spans. To rigorously assess performance, experiments employed multiple prompting strategies, ranging from zero-shot instructions to more advanced approaches such as Chain-of-Thought (CoT) reasoning and Few-shot In-Context Learning (FICL). The study introduced a unified evaluation framework built on a carefully curated dataset validated through high inter-annotator agreement (κ ≥ 0.758), ensuring the reliability and robustness of the benchmark.

Researchers meticulously annotated textual instances, defining causality strictly as a factual, explicit, and unambiguous relationship stated within the source text, thereby excluding inferred or speculative links. Each LLM was evaluated on both Causal Detection and Causal Extraction tasks, with performance metrics recorded to quantify their ability to recognize and isolate cause–effect relationships. Results showed that DeepSeek-R1-Distill-Llama-70B achieved the highest mean score in causal detection at 49.57% (Cdetect), while Qwen2.5-Coder-32B-Instruct led in causal extraction with a score of 47.12% (Cextract). These results enable a fine-grained comparison of model strengths and weaknesses across varying levels of causal complexity.

Experimental findings revealed that LLMs perform relatively well when identifying simple, explicit causal relations expressed within a single sentence. However, performance deteriorated substantially when models were confronted with more challenging scenarios. Implicit causal relationships, causal links spanning multiple sentences, and passages containing multiple interacting causal pairs consistently proved difficult for all evaluated models. Notably, even the top-performing detection model achieved a mean score below 50%, underscoring a significant gap in reliably identifying causal relationships from natural language text. Similarly, extraction performance remained below 50%, highlighting the difficulty LLMs face in precisely isolating cause and effect spans, even when causality is explicitly stated.

The study further demonstrated that discerning causation from correlation remains a major limitation for current LLMs, particularly in linguistically complex or context-dependent settings. Performance declined sharply in texts requiring cross-sentence reasoning or disambiguation of overlapping causal structures, revealing fundamental challenges in causal language understanding. To promote transparency and reproducibility, the researchers made all datasets, code, and prompting templates publicly available at https://github.com/sydneyanuyah/CausalDiscovery, encouraging further research and benchmarking in this area.

Overall, this work provides a detailed evaluation of thirteen open-source LLMs across twelve experimental configurations, offering a robust and standardized benchmark for assessing causal reasoning in text. By jointly examining causal detection and extraction, the study delivers a comprehensive assessment of LLM causal understanding, establishing a critical foundation for future model development. The findings clearly demonstrate that despite advances in language modeling, reliable causal reasoning remains an open challenge, particularly for applications in sensitive domains such as biomedicine, policy analysis, and scientific discovery.

LLMs struggle with causal relationship extraction, often mistaking

Researchers have assessed the ability of large language models (LLMs) to identify and extract causal relationships from text, a crucial skill for deployment in fields like biomedicine. A new benchmark, utilising 12 diverse datasets, was created to evaluate LLMs on pairwise causal discovery (PCD), focusing on both detecting the presence of causal links and pinpointing the exact cause and effect phrases. Testing 13 open-source LLMs with varying prompting techniques, including zero-shot, Chain-of-Thought, and In-Context learning, revealed significant limitations in current models. The best performing model for causal detection achieved a mean score of only 49.57%, while the top model for causal extraction reached 47.12%. Performance was strongest when dealing with simple, explicit relationships contained within single sentences, but drastically decreased when faced with implicit relationships, multi-sentence links, or texts containing multiple causal pairs.

👉 More information
🗞 Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
🧠 ArXiv: https://arxiv.org/abs/2601.15479

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Hvd Advances Text-Video Retrieval by Mimicking Human Vision with Key Frame Selection

Hvd Advances Text-Video Retrieval by Mimicking Human Vision with Key Frame Selection

January 27, 2026
Cefgc Achieves 3-Round Federated Graph Classification with Generative Diffusion Models

Cefgc Achieves 3-Round Federated Graph Classification with Generative Diffusion Models

January 27, 2026
Predicting Healthcare Flows: Four-Year Mobility Data Improves Hospital Visitation Analysis

Predicting Healthcare Flows: Four-Year Mobility Data Improves Hospital Visitation Analysis

January 27, 2026