Self-consistency Sampling Enhances Outcome-reward-based Reinforcement Learning of Multimodal LLMs, Correcting Unfaithful Trajectories

Outcome-reward based reinforcement learning is becoming increasingly important for improving the reasoning abilities of multimodal large language models, but a key challenge exists in accurately evaluating the quality of those reasoning steps. Jiahao Wang, Weiye Xu, and Aijun Yang, from Xi’an Jiaotong University, along with Wengang Zhou from the University of Science and Technology of China, and colleagues, address this problem by introducing a new method called Self-Consistency Sampling. Their approach tackles the issue of models receiving equal credit for both correct and flawed reasoning, by assessing the reliability of a model’s thought process through visual perturbations and repeated trajectory analysis. The team demonstrates that integrating this method into existing reinforcement learning frameworks boosts accuracy on multiple multimodal benchmarks by up to 7. 7 percentage points, and achieves gains across different model architectures, offering a broadly applicable solution for refining the reasoning capabilities of these powerful systems.

Consistency Reward Improves Reasoning in Language Models

This research introduces a consistency reward, known as Self-Consistency Sampling (SCS), designed to enhance the reasoning capabilities of large language models and multimodal large language models. The team observed that traditional reinforcement learning methods, which focus solely on answer accuracy, often fail to distinguish between genuine reasoning and lucky guesses. SCS encourages models to produce consistent outputs across multiple attempts, leading to more reliable and trustworthy AI systems. The method involves generating several reasoning paths for a given question and rewarding the model for agreement among these paths.

Experiments demonstrate that SCS consistently improves performance across various algorithms, including GRPO, REINFORCE++, and a baseline, by between 0. 6 and 7. 7 percentage points. The best results are achieved by carefully balancing the truncation ratio and the number of resampled trajectories, with gains observed across different model sizes and architectures. While SCS introduces a slight increase in training time, approximately 38%, the performance gains justify the added computational cost. Ablation studies confirm the effectiveness of both the truncation-resampling and visual-perturbation components of SCS, with the full combination delivering the largest gains. These results demonstrate that SCS is a valuable technique for improving the trustworthiness and accuracy of multimodal large language models in complex reasoning tasks.

Consistent Reasoning Improves Multimodal Model Reliability

The research team has developed Self-Consistency Sampling (SCS), a method to improve the reliability of multimodal large language models (MLLMs) when solving complex reasoning problems. These models, increasingly used for tasks requiring both visual and textual understanding, often arrive at correct answers through flawed reasoning processes, hindering their trustworthiness. SCS addresses this by introducing small visual perturbations to input images and repeatedly truncating and resampling initial reasoning trajectories. The method then assesses the consistency among these trajectories, down-weighting unreliable reasoning paths during model training.

Experiments demonstrate that incorporating SCS into existing reinforcement learning algorithms, specifically RLOO, GRPO, and REINFORCE++, yields significant accuracy improvements across six multimodal benchmarks. Quantitative analysis revealed a 15. 2% reduction in instances of unfaithful reasoning when using Qwen2. 5-VL-7B-Instruct with SCS, compared to the baseline model. The authors acknowledge that the benefits of SCS diminish if the truncation ratio is either too small or too large, or if the number of resampled trajectories exceeds an optimal point. Despite these limitations, the research presents a significant advancement in ensuring that multimodal models not only provide correct answers but also demonstrate sound reasoning processes.

👉 More information
🗞 Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
🧠 ArXiv: https://arxiv.org/abs/2511.10648

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Levitated Oscillators Achieve Coupled Dynamics with Simulated ‘Ghost’ Particle Interaction

Quantum Computers Extract Scattering Phase Shift in One-Dimensional Systems Using Integrated Correlation Functions

January 10, 2026
Framework Achieves Multimodal Prompt Injection Attack Prevention in Agentic AI Systems

Quantum Private Query Security Advances Database Protection, Mitigating Post-Processing Threats

January 10, 2026
Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

Quantum Key Distribution Achieves Higher Rates Without Authentication or Information Leakage

January 10, 2026