Recent advances in large language models increasingly depend on deliberate, step-by-step reasoning, with strategies like ‘think twice’ demonstrating the benefits of a second review. However, most reward models still condense complex qualities into a single assessment, leading to diluted focus and superficial analysis. Yizhu Jiao from University of Illinois Urbana-Champaign, Jiaqi Zeng and Julien Veron Vialard from NVIDIA, along with colleagues, introduce a new approach, branch-and-rethink (BR-RM), which applies the ‘think twice’ principle to reward modelling. This two-stage system first identifies key evaluation areas, such as factual accuracy and safety, then performs a focused re-evaluation, scrutinising only the most relevant information. By shifting from all-encompassing scoring to targeted reasoning, BR-RM reduces analytical diffusion and improves the detection of subtle errors, while remaining practical for large-scale applications, and experiments demonstrate state-of-the-art performance on challenging reward benchmarks.

Adaptive Branching Improves Coding Reward Models

This research presents a significant advancement in reward modeling for artificial intelligence, addressing a key limitation of traditional models by providing a more nuanced and granular assessment of quality. Scientists developed a novel framework, branch-and-rethink (BR-RM), designed to improve the accuracy and focus of reward signals used in large language model training. The model mimics a deliberate, two-pass reasoning process, first narrowing attention to critical dimensions and then re-examining the response with targeted scrutiny. The initial phase, adaptive branching, involves the model selecting a small set of relevant cognitive dimensions, such as factual accuracy and safety, and generating a concise sketch highlighting potential weaknesses.

This focused analysis replaces broad assessment with targeted investigation, concentrating analytical resources where risks are highest. The second turn, branch-conditioned rethinking, employs these findings to re-read the response, specifically through the lens of the flagged dimensions, verifying facts and checking reasoning. To demonstrate the model’s effectiveness, researchers used a C++ coding task, requiring the implementation of a function to return a sorted vector of integers without even digits. The model successfully distinguished between two functionally correct responses, ranking the more complete and usable response higher due to its inclusion of a runnable example.

This demonstrates the model’s ability to identify subtle quality differences often missed by conventional reward systems. Extensive experiments confirm that BR-RM achieves state-of-the-art performance across three challenging reward modeling benchmarks, consistently outperforming strong baselines in diverse domains such as reasoning, general knowledge, safety, and alignment with human preferences. Analysis reveals that the model adaptively concentrates its generative analysis on a few critical dimensions for each task, unlike previous models that spread attention broadly. The team trained the model using a reinforcement learning approach, employing strict format checks to ensure clean supervision and compatibility with standard infrastructure.

The strong performance of the model suggests that combining advanced language models with sophisticated reward modeling techniques is a promising direction for future research. The researchers will soon release the code and models developed during this study, enabling further exploration and refinement of this innovative approach. This work represents a step towards building AI that not only works but also excels.

Branch and Rethink Improves Reward Model Focus

Scientists developed a novel reward modeling framework, branch-and-rethink (BR-RM), designed to improve the accuracy and focus of reward signals used in large language model training. This work addresses the problem of “judgment diffusion,” where existing reward models spread their attention too thinly across multiple evaluation criteria, leading to shallow analysis and missed errors. The BR-RM framework operates in two turns, mirroring the benefits of a deliberate second pass in reasoning tasks. In the first turn, the model performs adaptive branching, selecting a small set of instance-critical dimensions, such as factuality and safety, and sketching concise hypotheses about potential weaknesses in a given response.

This focused analysis replaces broad assessment with targeted investigation. The second turn then executes branch-conditioned rethinking, where the model re-reads the response specifically through the lens of the flagged dimensions, verifying facts, checking reasoning, and examining potential bugs. This targeted re-evaluation demonstrably reduces judgment diffusion and minimizes bias driven by stylistic elements. Experiments demonstrate that this two-turn approach achieves state-of-the-art performance on three challenging reward modeling benchmarks. Analysis reveals that the BR-RM concentrates its generative analysis on critical dimensions, unlike recent reasoning reward models that spread attention broadly. The team trained the model using a reinforcement learning approach, employing strict format checks to ensure clean supervision and compatibility with standard infrastructure. This delivers a focused, second-look reasoning process that reduces judgment diffusion and improves sensitivity to subtle errors, while remaining practical and scalable for use in large language model training pipelines.

Two-Stage Judgment Improves Preference Reliability

This work introduces a novel reward model that reframes judgment as a two-stage process, mirroring the benefits of a “think twice” approach used in solving complex tasks. The model first identifies a focused set of critical hypotheses relevant to an output, then re-evaluates the response specifically through the lens of those hypotheses. The initial phase involves the model selecting a small set of relevant cognitive dimensions and generating a concise sketch highlighting potential weaknesses. This focused analysis replaces broad assessment with targeted investigation, concentrating analytical resources where risks are highest.

The second turn then re-reads the response, specifically through the lens of the flagged dimensions, verifying facts and checking reasoning. This targeted re-evaluation demonstrably reduces judgment diffusion, a common problem where reward signals become diluted across multiple evaluation criteria, and minimizes bias driven by stylistic elements. Consequently, the model exhibits improved sensitivity to subtle factual or logical errors, leading to more reliable preference judgments. Experimental results confirm the effectiveness of this approach, demonstrating state-of-the-art performance on challenging reward benchmarks. The authors acknowledge that the model’s performance relies on in-domain data for complex reasoning. Future research directions include integrating external verification tools, such as retrieval systems and code.

👉 More information
🗞 Think Twice: Branch-and-Rethink Reasoning Reward Model
🧠 ArXiv: https://arxiv.org/abs/2510.23596

Stay current. See today’s quantum computing news on Quantum Zeitgeist for the latest breakthroughs in qubits, hardware, algorithms, and industry deals.

Tags:

Reinforcement Learning