AI Learns from Its Mistakes with New ‘positive-Negative Pairing’ Technique

Researchers are tackling the challenge of reliably training large language models to reason with deterministic outcomes, a key area for advancement in artificial intelligence. Xin Sheng, Jiaxin Li, and Yujuan Pang from Beijing University of Post and Telecommunications, alongside Ran Peng from Sichuan Agricultural University and Yong Ma, demonstrate a novel approach to reinforcement learning with verifiable rewards (RLVR) that moves beyond simple variance-based prompt selection. Their work highlights the importance of identifying prompts that not only succeed but also reveal critical failure points, enabling more stable and effective optimisation. By introducing positive-negative pairing and weighted group-normalised advantages, the team amplifies learning signals from both successful and unsuccessful attempts, achieving significant performance gains on benchmark datasets like Qwen2.5-Math-7B and Qwen2.5-Math-7B-Instruct, and rivaling results obtained with substantially larger prompt pools.

Bidirectional Prompting via Positive and Negative Pairing for Enhanced Reinforcement Learning improves sample efficiency and policy generalization

Scientists have developed a novel approach to reinforcement learning with verifiable rewards (RLVR) that significantly enhances the mathematical reasoning capabilities of large language models. This work addresses a critical challenge in RLVR: efficient prompt selection, particularly when working with limited data.
Researchers demonstrate that pairing a single ‘hard-but-solvable’ prompt with an ‘easy-but-brittle’ prompt consistently outperforms methods relying on traditional variance-based selection heuristics. The core innovation lies in amplifying informative signals from rare events, successes on difficult prompts and failures on simpler ones, to create a bidirectional teaching mechanism.

This study introduces ‘positive, negative pairing’, a technique where prompts are selected to provide both a reliable positive anchor and explicit negative learning signals. A hard-but-solvable prompt, exhibiting a low but non-zero success rate, generates sharp positive guidance when solved. Simultaneously, an easy-but-brittle prompt, with a high but imperfect success rate, delivers strong negative penalties upon failure.

This pairing concentrates learning on the most informative instances, improving sample efficiency and preventing suppression of exploration. The researchers further refined this approach with Weighted GRPO, a method that reweights binary outcomes at the pair level and uses group-normalized advantages to amplify these rare events.

Similar improvements were also observed on the Qwen2.5-Math-7B-Instruct model, demonstrating the robustness and generalizability of this new technique. This research suggests that, with careful prompt selection, achieving significant advancements in LLM mathematical reasoning requires surprisingly little data.

Hard and easy prompt pairing with weighted group reinforcement protocol optimisation shows promising results

A positive, negative pairing strategy underpinned the reinforcement learning methodology employed in this work. At each update step, the research team sampled one hard-but-solvable prompt and one easy-but-brittle prompt, carefully characterizing these based on empirical success rates obtained through multiple rollouts.

The hard prompt was defined by a low success rate, yet still solvable, ensuring rare successes would provide strong positive guidance for the language model. Conversely, the easy prompt exhibited a high success rate but was prone to occasional failures, generating strong negative penalties from these rare instances.

This paired approach was implemented using Weighted GRPO, a modified optimization algorithm that reweights binary outcomes at the pair level. Binary outcomes, success or failure, were not treated equally, but rather adjusted to amplify the impact of rare events. Specifically, the algorithm used group-normalized advantages to boost rare successes into positive reinforcement signals and convert rare failures into strong negative penalties.

This bidirectional signal aimed to provide more informative learning feedback, enhancing sample efficiency without unduly restricting exploration. Performance was evaluated on Qwen2.5-Math-7B, where a single paired minibatch per update consistently outperformed a GRPO baseline utilizing variance-based prompt selection heuristics.

Paired prompt optimisation surpasses variance-based selection on mathematical reasoning benchmarks, achieving state-of-the-art results

Similar gains were observed on Qwen2.5-Math-7B-Instruct, demonstrating the robustness of the approach. The work demonstrates that employing an easy-but-brittle prompt alongside a hard-but-solvable prompt yields a more stable optimization direction than variance-based selection. Across AIME 2025 and AMC23, a clear gap emerged, particularly at moderate k values, with representative improvements such as a rise from 16.8 to 22.2 on AIME 2025 at k=8 and from 94.0 to 97.0 on AMC23 at k=64.

The improvement on MATH500 was smaller but remained generally consistent across all tested values of k. For Qwen2.5-Math-7B, the method recovered 41.6 at k=64 using three orders of magnitude fewer training prompts than the GRPO+DSR-sub baseline. On AMC23, WGRPO+{π1209, p12} surpassed GRPO+DSR-sub at larger k, achieving a score of 97.0 versus 95.6 at k=64, and on MATH500 it became best or near-best for k ≥4, reaching 89.9 versus 87.9 at k=8 and 92.4 versus 90.0 at k=16.

Replacing WGRPO with GRPO while maintaining the same two training prompts resulted in a consistent drop in performance, as demonstrated on AIME 2025 where scores decreased from 16.3 to 22.2 at k=8 and from 24.1 to 29.0 at k=16. Similarly, on MATH500, the gap was notable, with scores changing from 86.5 to 89.9 at k=8 and from 90.6 to 92.4 at k=16, indicating the necessity of reshaping the training signal to prioritize rare but meaningful outcomes.

Positive-negative pairing enhances reinforcement learning for mathematical problem solving by improving exploration and credit assignment

A new approach to prompt selection within reinforcement learning with verifiable rewards demonstrably improves performance on deterministic outcome reasoning tasks. This work introduces positive, negative pairing, a method for selecting training prompts that combines a challenging, yet solvable prompt with an easy, but fragile one.

This pairing strategy, coupled with a weighted group-normalized policy gradient method, amplifies rare successes and penalises rare failures, creating a more stable and informative learning signal. The research demonstrates consistent gains in mathematical reasoning benchmarks, including AIME 2025 and AMC23, using only two fixed training prompts.

This represents a significant improvement in sample efficiency compared to existing reinforcement learning methods that require large prompt pools. By focusing on rare events and bidirectional feedback, the technique reduces sensitivity to sampling noise and stabilises the optimisation process. The authors acknowledge that their method was tested in a low-data regime and may not generalise to scenarios with abundant training data.

Future research could explore the application of this prompt selection strategy to other reinforcement learning tasks and investigate its effectiveness with different model architectures. The findings suggest that carefully structuring the training signal is particularly crucial when data is limited, offering a pathway towards more efficient and robust training of large language models for mathematical reasoning and potentially other complex domains.

👉 More information
🗞 Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
🧠 ArXiv: https://arxiv.org/abs/2602.03452

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

AI-Powered Tool Finds Hidden Flaws in Secure Data Systems with Ease

AI-Powered Tool Finds Hidden Flaws in Secure Data Systems with Ease

February 12, 2026
Privacy-Focused AI Learns Faster and More Securely with New Blockchain System

Privacy-Focused AI Learns Faster and More Securely with New Blockchain System

February 12, 2026
Monitoring Quantum Systems Can Actually increase Chaos, Defying Previous Expectations

Monitoring Quantum Systems Can Actually increase Chaos, Defying Previous Expectations

February 12, 2026