Researchers are increasingly concerned that large language models (LLMs) exhibit hidden biases despite generating seemingly logical reasoning chains. Iván Arcuschin from University of Buenos Aires, David Chanin from University College London, Adrià Garriga-Alonso from Independent, and Oana-Maria Camburu from Imperial College London present a novel, automated pipeline to detect these ‘unverbalized’ biases within LLM decision-making processes. Their work moves beyond reliance on predefined bias categories and hand-crafted datasets, instead utilising LLM autoraters to identify task-specific biases through statistically significant performance variations. This approach successfully uncovered previously unknown biases, such as those relating to Spanish fluency and writing formality, across six LLMs tested on hiring, loan approval, and university admissions tasks, while also validating existing understandings of biases concerning gender, race, and religion. Consequently, this research offers a scalable and practical method for continuous, automatic bias discovery in LLMs, representing a significant step towards more transparent and equitable artificial intelligence systems.
These biases, termed ‘unverbalized biases’, influence model decisions without being explicitly mentioned in the reasoning the LLM provides. Existing methods for evaluating bias typically rely on predefined categories and manually created datasets, proving both limiting and resource-intensive.
This new work circumvents these limitations by employing LLM ‘autoraters’ to generate potential bias concepts directly from task datasets. The pipeline then rigorously tests these concepts by creating variations in input data, specifically positive and negative examples, and applying robust statistical analysis to identify significant performance differences.
A key indicator of an unverbalized bias is a statistically significant change in model output that is not accompanied by a corresponding mention of the concept within the model’s chain-of-thought reasoning. This automated approach offers a scalable path towards identifying task-specific biases, moving beyond manual hypothesis generation and predefined categories.
The research demonstrates a practical method for uncovering hidden preferences within LLMs, enhancing the reliability and transparency of these increasingly powerful systems. The system’s efficiency is further enhanced through input clustering, staged sampling, and statistical early stopping techniques, achieving approximately one-third savings in computational cost compared to exhaustive evaluation methods.
By applying the pipeline to existing bias studies, researchers confirmed its adaptability across different dimensions and contexts, providing additional insights into how LLMs articulate their reasoning. This work represents a significant step towards understanding and mitigating the subtle biases that can affect the decision-making processes of large language models.
Concept hypothesis generation and statistical testing via controlled input variation
A multi-stage pipeline systematically tests concept hypotheses to detect unverbalized biases in large language models. Initially, inputs from a task dataset are embedded using a text embedding model and then grouped via k-means clustering to create semantically similar input sets. Representative inputs are sampled from each cluster, and a separate large language model is prompted to generate concept hypotheses potentially influencing the target model’s behaviour, ensuring the hypothesis-generating model remains blind to the target model’s responses.
This initial phase uses only 30 inputs for hypothesis generation, while subsequent statistical testing is performed on 766, 2,493 inputs per concept, establishing a clear separation between hypothesis and evaluation. Concept hypotheses are then tested through controlled input variations, creating positive and negative variants of each input to promote or diminish the concept.
The model’s binary decision is recorded for both the original input and its variants, and discordant pairs, instances where the model’s decision changes, are identified. Statistical significance is assessed using McNemar’s test, requiring |D+| = |D−| with p To determine whether a concept remains unverbalized, a verbalization rate is calculated on the identified discordant pairs, assessing whether the concept is cited as justification in the model’s reasoning.
A concept is flagged as an unverbalized bias if the verbalization rate falls below a predefined threshold τ, indicating the model’s decision is influenced by the concept without acknowledging it in its reasoning. The pipeline incorporates statistical early stopping techniques, including O’Brien-Fleming alpha spending and futility analysis, to manage computational costs and achieve approximately one-third savings over exhaustive evaluation.
Uncovering Hidden Influences on Large Language Model Outputs
Researchers developed a fully automated pipeline for detecting task-specific unverbalized biases in large language models (LLMs). The technique automatically discovers previously unknown biases, such as Spanish fluency, English proficiency, and writing formality, alongside validating biases identified in prior work including gender, race, religion, and ethnicity.
This work introduces a method for identifying biases that are not explicitly stated in the LLM’s reasoning process. The pipeline operates by generating candidate bias concepts using LLM autoraters, then testing these concepts on progressively larger input samples. Statistical techniques, including McNemar’s test, are applied to identify concepts yielding statistically significant performance differences without being cited in the LLM’s chain-of-thought reasoning.
A concept is flagged as an unverbalized bias if it causes statistically significant changes in model decisions while remaining unmentioned in the reasoning. Input clustering and concept generation begin with embedding inputs and applying k-means clustering to group similar examples. From each cluster, a small number of representative inputs, a total of 30, are sampled to prompt an LLM to hypothesize concepts influencing model behaviour.
Statistical testing is then performed on 766 to 2,493 inputs per concept, clearly separating hypothesis generation from inference. The system generates paired input variations, promoting or diminishing a concept, and uses an LLM judge to filter out test cases introducing confounds beyond the target concept.
Before testing, baseline responses are collected and concepts verbalized in more than a defined threshold, τ, are filtered out as a cost-saving measure. Variation responses are then collected, and a further verbalization filter is applied to discordant pairs, cases where the model’s decision flipped between variations, dropping concepts cited as justification.
Statistical testing using McNemar’s test determines whether a concept influences the model’s behaviour, with a significance level of α and a conditional power threshold of γ used for early stopping. This pipeline identifies what are termed ‘unverbalized biases’, factors systematically affecting outputs but absent from the chain-of-thought justifications provided by the LLM.
The approach employs LLM-based ‘autoraters’ to propose potential bias concepts, then statistically tests these concepts by generating variations in input data and observing resulting performance differences. This automated method offers a scalable solution for discovering task-specific biases, moving beyond reliance on predefined categories and manually created datasets.
The authors acknowledge that the detected biases are descriptive, systematic shifts in decision-making, rather than inherently normative judgements of unfairness, requiring further contextual audit to determine appropriateness. The pipeline also relies on a threshold of 30% for considering a concept ‘unverbalized’, meaning that if a bias is mentioned in more than 30% of relevant reasoning traces, it is not flagged as hidden. Future research could focus on refining this threshold and exploring methods to address the identified biases, potentially through techniques like bias mitigation training or algorithmic adjustments to promote fairer outcomes.
👉 More information
🗞 Biases in the Blind Spot: Detecting What LLMs Fail to Mention
🧠 ArXiv: https://arxiv.org/abs/2602.10117
