AI LLMs Resist Manipulation with Epistemic Training

Researchers are tackling the pervasive problem of cognitive bias in large language models (LLMs) used for automated judgement. Qian Wang from the National University of Singapore, alongside Xuandong Zhao, Zirui Zhang, et al., demonstrate a novel approach to mitigate these biases, which commonly cause LLMs to alter reasoning based on misleading cues within prompts. Their work is significant because current methods, such as prompting or supervised fine-tuning, lack generalisability, failing to address the underlying optimisation objective that allows bias to influence outcomes. This team introduces Epistemic Independence Training (EIT), a reinforcement learning framework designed to make bias cues non-predictive of reward, thereby fostering robust and transferable epistemic independence in LLMs.

These models, despite advances in reasoning capabilities, remain susceptible to spurious cues such as consensus claims or authority appeals, altering their judgements based on irrelevant information.

Why Current Bias Mitigation Methods Fail

Why Existing Bias Mitigation Methods Fall Short

Existing mitigation strategies, including prompting and supervised fine-tuning, have proven ineffective as they address surface behaviours without altering the underlying optimization objective that makes these bias cues predictive. This work introduces EIT, grounded in the principle that to achieve independence, bias cues must be made non-predictive of reward.
The framework operationalizes this through a balanced conflict strategy, ensuring bias signals are equally likely to support both correct and incorrect answers. A specifically designed reward system penalizes models for following bias towards incorrect answers, but does not reward agreement with bias when it aligns with the truth.

Experiments conducted on the Qwen3-4B model demonstrate that EIT significantly improves both accuracy and robustness when faced with adversarial biases. Notably, models trained solely on bandwagon bias exhibit generalization to unseen bias types, including authority and distraction, suggesting that EIT fosters transferable epistemic independence rather than bias-specific heuristics.

Performance gains include a 13.2% improvement in adversarial-bias accuracy, increasing from 70.1% to 83.3% on Qwen3-4B, alongside a 16.4% increase in robustness. Further analysis reveals that EIT induces substantive reasoning, characterized by explicit domain engagement and logical verification, rather than superficial mimicry of independent behaviour. Code and data supporting this research are publicly available.

Reinforcement learning mitigates prompt-based cognitive bias in large language models by aligning outputs with human preferences

Operationalizing Resistance on Quantum Hardware

Operationalizing Bias Resistance on Quantum Hardware

A 72-qubit superconducting processor forms the foundation of this research, utilized to investigate cognitive biases present in large language models acting as automated judges. This work operationalizes the core principle of epistemic independence through a balanced conflict strategy, ensuring bias signals support correct answers in 50% of cases and incorrect answers in the remaining 50%.

This statistical orthogonality between bias and truth forces the model to move beyond shortcut exploitation and engage in intrinsic reasoning. The team employed reinforcement learning to optimize for this outcome, contrasting with supervised fine-tuning which often encourages surface-level imitation of independence.

Testing Generalization Across Multiple Cognitive Biases

Testing Generalization Across Various Cognitive Biases

Bias injection was performed on the MMLU-Pro dataset, initially training on bandwagon bias, simulating social consensus, and subsequently evaluating generalization across unseen bias types. Specifically, the study tested authority bias, distraction bias, and position bias, categorizing the latter two as semantic biases related to social influence and the former as a structural cue.

The EIT algorithm utilizes a hierarchical reward function comprising structural constraint, factual accuracy, and an independence incentive, each component designed to prevent specific failure modes. Structural constraint ensures parsable Chain-of-Thought reasoning paths, receiving a reward of α if the response adheres to this format.

Factual accuracy is rewarded with a positive value only for correct answers, preventing random contrarianism. The independence incentive, a context-dependent reward, actively penalizes bias-following when it contradicts the truth, assigning a bonus for correct answers even when the bias is incorrect and a penalty for following the bias. Specifically, on the Distraction Bias test set, accuracy under wrong-bias conditions increased from 45.9% to 85.0%, with robustness improving from 40.6% to 79.7%.

For the in-domain Bandwagon Bias, accuracy with wrong-bias improved from 70.1% to 83.3%, accompanied by a rise in robustness from 63.6% to 80.0%. The study reveals that Qwen3-1.7B converges earlier at step 70, while Qwen3-4B converges around step 110, suggesting smaller models reach capacity sooner during training.

Validation set results for Qwen3-1.7B show that EIT achieves an accuracy of 0.771 on clean prompts, 0.817 on Bandwagon bias with correct bias, and 0.748 on Bandwagon bias with wrong bias. Qwen3-4B trained with EIT achieves an accuracy of 0.808 on clean prompts, 0.841 on Bandwagon bias with correct bias, and 0.731 on Bandwagon bias with wrong bias.

Test set results further demonstrate EIT’s effectiveness; Qwen3-4B with EIT achieves 0.844 accuracy on clean prompts, 0.919 on Bandwagon bias with correct bias, and 0.833 on Bandwagon bias with wrong bias. Notably, EIT exhibits transferability, improving performance on semantic biases, while leaving performance on positional biases largely unaffected, as evidenced by a decrease from 0.410 to 0.364 on the position bias test set for Qwen3-4B. This training method alters the model’s learning process by making misleading cues, such as claims of consensus or appeals to authority, irrelevant to achieving a reward.

The approach employs a balanced conflict strategy, presenting scenarios where bias signals can support both correct and incorrect answers, alongside a reward system that discourages following biases without rewarding agreement with them. Qualitative analysis reveals that this training cultivates genuine reasoning, with models engaging relevant knowledge, performing explicit verification through computation, and overriding conflicting bias signals with justification, unlike supervised fine-tuning which often produces only performative independence.

The authors acknowledge that current models still exhibit limitations, as demonstrated by the potential for superficial independence where models mimic independence language without engaging in actual reasoning. Future research could explore scaling this training to larger models and more complex bias scenarios, as well as investigating methods to further strengthen the connection between demonstrated independence and robust cognitive processes. These findings represent a step towards more reliable and unbiased automated judging systems, with implications for applications requiring objective evaluation and decision-making.

👉 More information
🗞 Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.01528
Muhammad Rohail T.

Latest Posts by Muhammad Rohail T.: