Scientists are increasingly focused on understanding the inner workings of large language models, but current interpretability research frequently suffers from a lack of generalisability and overstated causal claims. Shruti Joshi from Mila, Quebec AI Institute & Université de Montréal, Aaron Mueller from Boston University, and David Klindt from Cold Spring Harbor Laboratory, working with Wieland Brendel, Patrik Reizinger, and Dhanya Sridhar from the Max-Planck-Institute for Intelligent Systems, ELLIS Institute Tübingen, and University of Tübingen, alongside further collaboration at Mila, Quebec AI Institute & Université de Montréal, address this critical issue by advocating for a rigorous application of causal inference principles. Their work clarifies the necessary evidence required to support interpretability claims, outlining how observations and interventions establish associations but fall short of proving counterfactual relationships without controlled supervision. By operationalising a causal hierarchy through causal representation learning, the researchers present a diagnostic framework designed to ensure that interpretability methods and evaluations are appropriately aligned with the evidence, ultimately promoting more robust and generalisable findings in the field.
Imagine trying to understand why a complex machine malfunctions; observing broken parts isn’t enough. Similarly, analysing artificial intelligence needs more than spotting patterns in its responses. Establishing clear cause and effect within these systems is essential if we want to truly understand and reliably improve them. Scientists investigating large language models (LLMs) are confronting persistent challenges in translating initial research successes into dependable, widely applicable results.
Despite advances in understanding how these models function internally, many findings struggle to generalise beyond specific test conditions, and interpretations of causal relationships often extend beyond what the available evidence supports. This work addresses a fundamental issue: the need for a clearer connection between interpretability research and the ability to make verifiable claims about a model’s behaviour.
Rather than observing correlations between model components and outputs, researchers are now focusing on establishing causal links, identifying how specific internal mechanisms genuinely influence the model’s responses. Pinpointing causation within a complex neural network is not straightforward. Observations of associations between behaviour and internal components are a starting point, but insufficient to establish a causal relationship.
Interventions, such as selectively disabling parts of the model or altering its internal activations, offer stronger evidence, demonstrating how changes affect measurable outcomes across a range of inputs. However, determining what would have happened under different, unobserved conditions remains a significant hurdle. A new framework proposes to ground interpretability claims within the established principles of causal inference, drawing upon Pearl’s causal hierarchy to clarify what constitutes valid evidence.
This hierarchy distinguishes between observing relationships, demonstrating how interventions change behaviour, and making counterfactual predictions, assessing what the model’s output would be under alternative scenarios. By operationalising this hierarchy through causal representation learning, researchers aim to specify which aspects of a model’s internal state are recoverable and under what assumptions.
At the core of this approach is the idea that interpretability research should explicitly define the ‘estimand’, the precise quantity being targeted, and the ‘intervention class’, the specific manipulations used to investigate causal effects. Once these are clearly defined, practitioners can better select appropriate methods, evaluate results rigorously, and predict when findings will reliably extend to new situations. This diagnostic framework offers a pathway towards more dependable and generalisable insights into the inner workings of LLMs, moving beyond local successes to achieve lasting progress in the field.
Establishing causal links between internal states and language model outputs
A central tenet of this work involves applying principles from causal inference to interpretability research on large language models. Specifically, the research establishes a framework grounded in Pearl’s causal hierarchy, beginning with observational studies to identify associations between model behaviour and internal activations. These observations form the initial stage, documenting correlations without implying any directional influence.
Following this, intervention experiments, such as ablations where model components are deactivated, or activation patching which modifies internal representations, were conducted to assess how alterations to these internal components affect specific behavioural metrics, measured as changes in token probabilities across a range of prompts. The study highlights a critical limitation of many interpretability approaches: the difficulty in making verifiable counterfactual claims.
Counterfactuals attempt to determine what a model’s output would have been under a different, unobserved intervention, a question that remains largely unanswered without controlled supervision. To address this, researchers focused on causal representation learning (CRL), a technique designed to specify which variables can be reliably recovered from model activations and, crucially, the underlying assumptions required for this recovery.
The methodology details a diagnostic framework intended to guide practitioners in selecting appropriate methods and evaluations. This framework aims to align interpretability claims with the supporting evidence, ensuring that findings are more likely to generalise beyond the specific experimental setup. Rather than demonstrating correlations, the approach prioritises establishing causal relationships within the model, moving beyond purely associational accounts of model behaviour. At the core of this lies a focus on estimands, the precise quantities a method targets, and intervention classes, defining the manipulations needed to estimate the effect of those interventions.
Establishing Causal Links Between Internal Activations and Model Behaviour
Once established, a clear framework for interpretability research becomes apparent through the application of causal reasoning. Initial analyses focused on associations between model behaviour and internal activations, revealing correlations but stopping short of establishing causal links. Interventional studies, where activations are directly manipulated, such as through ablation or patching, demonstrate how these alterations affect behavioural metrics, specifically the average change in token probabilities across a prompt set.
These interventions provide evidence for claims about how model components influence outputs. Yet, counterfactual reasoning, determining what the output would be given the same prompt under a different, unobserved intervention, remains a challenge. Without controlled supervision, these counterfactuals are unverifiable. Work presented details how causal representation learning (CRL) operationalises Pearl’s causal hierarchy, defining which variables are recoverable from activations and the assumptions needed for recovery.
Specifically, the research highlights that identifying estimands, the quantities answering a question of interest, requires specifying an estimator and acknowledging an equivalence class of indistinguishable hypotheses. At the core of this work lies the concept of identifiability, where an estimand is considered identifiable up to an equivalence class if a provably accurate estimator exists.
The study demonstrates that decodability, an associational claim, should not be used to justify control, an interventional effect. Still, the framework allows for a precise understanding of when proxy-based success might generalise, or conversely, fail to do so. By employing Pearl’s causal ladder, the research distinguishes between associational, interventional, and counterfactual questions, each requiring a different level of evidence.
Moving up the causal ladder, from associations to interventions to counterfactuals, demands stronger evidence. L1 evidence supports associational claims, L2 supports interventional claims, and L3 is needed for counterfactuals. Layerwise activations were recursively defined, with a(l) representing the activations at layer l, and h(l) denoting representations in a potentially learned basis. A feature is defined as a subspace within this representation, but its interpretability remains an empirical question.
Distinguishing correlation from causation in large language model interpretation
Scientists attempting to understand the ‘black box’ of large language models have made progress in identifying connections between a model’s inner workings and its outputs, but a persistent problem has been the overstatement of what these connections actually mean. For years, researchers have struggled to move beyond observing that something happens inside a neural network to understanding why it happens, and what that implies about the model’s underlying knowledge or reasoning processes.
This work doesn’t offer a new technical fix, but a framework for honest reporting, a vital step towards building genuinely interpretable artificial intelligence. The temptation to leap to causal conclusions from correlational data remains strong. Many studies demonstrate associations between specific internal activations and observed behaviours, but establishing causation requires more than just showing a change in output after manipulating those activations.
Instead, a clear hierarchy of evidence is needed, mirroring Pearl’s work on causal inference, to justify claims about what a model ‘knows’ or ‘does’. Observations can reveal patterns, interventions like ablation studies can show how changes affect behaviour, but asserting what a model would do under different, unobserved conditions demands a level of control rarely achieved.
The research proposes a diagnostic checklist, a practical tool for researchers to self-assess the strength of their claims. By explicitly stating the ‘rung’ of evidence supporting their interpretations, whether they’ve demonstrated association, intervention, or (more demanding) counterfactual reasoning, scientists can avoid overreaching. Even limited claims, carefully qualified and supported by multiple lines of evidence, can be valuable, particularly in constrained settings.
Widespread adoption of this framework may see a shift away from sensationalised findings towards more incremental, but reliable, progress. Beyond this specific checklist, a broader change in culture is needed. The pressure to publish exciting results has encouraged speculation over careful analysis. Rather than seeking to ‘reverse engineer’ entire algorithms, future work should focus on identifying specific, testable mechanisms, and acknowledging the limitations of any particular approach.
👉 More information
🗞 Causality is Key for Interpretability Claims to Generalise
🧠 ArXiv: https://arxiv.org/abs/2602.16698
