Researchers are increasingly recognising the potential of large language models to encode abstract concepts within their learned features. Aaditya Vikram Prasad, Connor Watts, and Jack Merullo, all from Goodfire AI, alongside Dhruvil Gala, Owen Lewis, and Thomas McGrath et al., demonstrate a novel application of these features as a scalable source of supervision for open-ended tasks. Their work addresses the critical problem of hallucination in language models by introducing RLFR, a reinforcement learning pipeline that utilises feature probing to identify and correct uncertain claims. This approach not only significantly reduces hallucination rates, achieving a 58% improvement on Gemma-3-12B-IT, but also offers a pathway towards more interpretable and controllable artificial intelligence systems, representing a paradigm shift in how we leverage model understanding for improved learning.
Leveraging internal factuality representations to mitigate language model hallucinations
Researchers have unlocked a new method for reducing inaccuracies in large language models by leveraging internal features that represent concepts like factuality. This work introduces RLFR, or Reinforcement Learning from Feature Rewards, a pipeline that repurposes these internal model features as a scalable reward system for open-ended tasks.
Traditionally, such features have been used for monitoring or steering model behaviour during testing, but this study demonstrates their potential as direct supervision signals during training. The core innovation lies in transforming a model’s internal “beliefs” , gauged through a probing framework, into a dense, cost-effective reward for reinforcement learning.
This pipeline specifically targets the persistent problem of hallucinations in language models, teaching them to identify and correct potentially false statements. By identifying candidate hallucinated claims, the system trains the model to intervene and refine its responses when uncertainty about factual accuracy is detected.
Furthermore, the use of feature-based rewards enables efficient, scalable computation during testing, guiding the model towards more reliable outputs. Operationalizing this process on the Gemma-3-12B-IT model resulted in a policy demonstrably 58% less prone to hallucination compared to the original model, all while maintaining performance on established benchmarks.
The research introduces a novel paradigm, grounding supervision in the language of model features rather than relying on external verification. A key component is a decomposed probing protocol that monitors for hallucinations and rewards the model for subsequent retractions and corrections. This approach proves to be approximately 90times cheaper per intervention than using a ground truth supervision source, offering a significant advantage in terms of computational cost. By effectively harnessing internal representations, this work paves the way for more reliable and trustworthy language models capable of tackling complex, open-ended tasks.
Reinforcement learning from internal model features for hallucination reduction
A decomposed probing protocol utilising model features underpinned the research into reducing model hallucinations. This pipeline first monitored for potential hallucinations by analysing internal model features, then rewarded retractions and corrections contingent on addressing those identified hallucinations.
Specifically, the study implemented a reinforcement learning (RL) pipeline, termed RLFR, to leverage these features as reward functions. The work centred on Gemma-3-12B-IT, instantiating the approach to create a policy demonstrably less prone to hallucination. Experiments revealed the resulting policy reduced hallucinatory outputs by 58% compared to the original model, while maintaining performance on established benchmarks.
The feature-derived rewards proved to be an efficient alternative to external evaluators, exhibiting approximately 90times lower cost per rewarded intervention compared to ground truth supervision. Furthermore, the research extended beyond enabling RL by utilising feature-based rewards to facilitate scalable test-time computation.
Standard techniques, such as Best-of-N sampling, were employed to improve the trained policy’s performance. This involved leveraging the reward features to guide the selection of the most reliable completions from a set of generated outputs. The study highlights that features encode abstract concepts, such as factuality and intent, traditionally used for monitoring or steering, but proposes their use as scalable supervision for open-ended tasks. This framework represents a novel paradigm in interpretability research, positioning features as oversight signals for intentionally designing models with desirable capabilities.
Reinforcement Learning from Internal Features Significantly Reduces Hallucinations in Large Language Models
A 58% reduction in hallucination rates was achieved through a novel reinforcement learning pipeline. This work introduces RLFR, or Reinforcement Learning from Feature Rewards, which leverages internal model features as reward functions for open-ended tasks. A new probing framework identifies candidate hallucinated claims, enabling the model to intervene and correct completions when factual uncertainty is detected.
The pipeline operationalized on Gemma-3-12B-IT demonstrably decreased hallucinatory responses while maintaining performance on established benchmarks. This study grounds supervision in the language of model features, presenting a paradigm shift in utilising interpretability for learning complex behaviours.
The research focuses on mitigating hallucinations, a persistent challenge in large language models, by reinforcing factuality through reward signals derived from internal feature readouts. These feature readouts are calibrated to reflect the model’s confidence in the validity of claims, providing a dense and inexpensive supervision signal.
The resulting policy exhibits a 58% lower propensity to hallucinate compared to the original model, signifying a substantial improvement in factual accuracy. Furthermore, the pipeline facilitates scalable test-time computation, again guided by the reward features extracted from the model. The probing framework effectively measures a model’s belief about concepts relevant to downstream tasks, such as the factual correctness of a statement.
This allows for the creation of a reward signal that directly addresses open-ended behaviours without relying on costly external verification. By repurposing these features as dense supervision, the research sidesteps the limitations of using large language models as judges, which can be slow and poorly calibrated.
The work introduces a novel application of model features, moving beyond traditional uses for test-time monitoring and steering. This approach enables reinforcement learning on behaviours that are difficult or impossible to verify directly, opening up possibilities for training models to exhibit more desirable and complex characteristics. The pipeline’s success with Gemma-3-12B-IT demonstrates the potential for broader application across various open-ended tasks and model architectures.
Mitigating Hallucinations via Reinforcement Learning from Learned Reward Features
Large language models demonstrate an ability to learn features representing abstract concepts such as factual accuracy and intention. These features are typically employed for monitoring performance or guiding model behaviour during use. Recent work presents an alternative application, utilising these features as a form of scalable supervision for more general tasks.
Specifically, the research addresses the problem of reducing instances of hallucination, the generation of factually incorrect statements, as a desirable, yet challenging, behaviour for these models. A reinforcement learning pipeline, termed RLFR, was developed to leverage these learned features as reward functions.
This pipeline incorporates a novel method for identifying potentially hallucinated claims, enabling the model to intervene and correct its outputs when uncertainty regarding factual correctness is detected. Furthermore, the system allows for efficient computation during use, again guided by the reward features derived from the model’s internal representations.
When applied to the Gemma-3-12B-IT model, this process resulted in a policy that exhibited a 58% reduction in hallucination rates, without compromising performance on established benchmark tests. This work introduces a new approach to utilising interpretability within language models to facilitate learning for open-ended tasks.
The key finding is that features encoding abstract concepts within a language model can be repurposed as rewards to train the model to reduce factual errors. This demonstrates a pathway towards improving the reliability of generated text without requiring extensive human annotation or task-specific training data.
The authors acknowledge that the current pipeline relies on a probing framework to identify hallucinated claims, which may not be perfect and could introduce its own biases. Future research directions include exploring more robust methods for identifying and correcting inaccuracies, as well as extending this approach to other open-ended tasks beyond hallucination reduction.
👉 More information
🗞 Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
🧠 ArXiv: https://arxiv.org/abs/2602.10067
