Scientists are continually striving to enhance the reasoning capabilities of large language models (LLMs), despite their existing limitations in consistently solving complex mathematical problems. Ali Hatamizadeh, Shrimai Prabhumoye, and Igor Gitman, alongside Ximing Lu, Seungju Han, and Wei Ping, present a novel approach called Iterative Group Relative Policy Optimization (iGRPO) that significantly improves performance through dynamic self-conditioning. This research introduces a two-stage reinforcement learning framework where the model iteratively generates and refines its own solutions, learning from its strongest prior attempts. Demonstrating consistent outperformance over existing methods like GRPO on diverse benchmarks, and achieving state-of-the-art results on AceReason-Math with OpenReasoning-Nemotron-7B, iGRPO highlights the potential of self-feedback mechanisms to advance verifiable mathematical reasoning and build more reliable LLMs.
This work addresses the limitations of current models in consistently solving complex problems by introducing a self-feedback mechanism inspired by human problem-solving strategies. iGRPO operates in two stages, first generating multiple draft solutions and selecting the most promising one, then using this “best draft” to refine subsequent attempts.
This iterative process allows the language model to learn from its own outputs, improving accuracy and reliability in a manner previously unseen in reinforcement learning approaches. The core innovation lies in the dynamic self-conditioning employed by iGRPO. Experiments demonstrate that iGRPO consistently outperforms standard GRPO across various base models, including Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled, validating its effectiveness on diverse reasoning benchmarks.
Notably, applying iGRPO to the OpenReasoning-Nemotron-7B model, trained on the AceReason-Math dataset, achieves a new state-of-the-art accuracy of 85.62% on the AIME24 benchmark. This represents a substantial improvement in performance and establishes a new standard for verifiable mathematical reasoning.
Furthermore, the model attained 79.64% on the AIME25 benchmark, demonstrating consistent gains across challenging datasets. Ablation studies confirm that the refinement wrapper generalizes beyond GRPO variants and benefits from a generative judge, altering learning dynamics by delaying entropy collapse.
These results highlight the potential of iterative, self-feedback-based reinforcement learning for advancing the field and unlocking more robust and accurate reasoning in large language models. This system computes relative rewards, identifying the highest-scoring draft which then functions as a “first-draft” output. Subsequently, this best draft is appended to the original prompt, providing the model with self-feedback.
The model then generates refined responses conditioned on this augmented context, effectively training the policy to surpass its previous best attempt. This process preserves the computational efficiency of GRPO while introducing minimal overhead, as iGRPO continues to rely on the same group-based reward signals.
Experiments were conducted using DeepSeek-R1 Distilled and OpenMath-Nemotron base models, trained on the Mathematics Aptitude Test of Heuristics (MATH) dataset. Performance was assessed across several benchmarks including AIME24, AIME25, MATH500, AMC23, GSM8K, and Minerva Math. Applying iGRPO to the OpenReasoning-Nemotron-7B model, trained on the AceReason-Math dataset, yielded a state-of-the-art accuracy of 85.62% on the AIME24 benchmark and 79.64% on AIME25. This result, obtained when applying iGRPO to the OpenReasoning-Nemotron-7B model, establishes a new state-of-the-art performance level.
Complementing this, the study also reports an accuracy of 79.64% on the AIME25 benchmark using the same methodology and model. The research details a two-stage process where iGRPO initially samples multiple drafts and selects the highest-reward option. This selected draft is then appended to the original prompt, enabling a GRPO-style update focused on draft-conditioned refinements.
Through this iterative process, the model is trained to surpass its initial strongest attempt, fostering continuous improvement in reasoning capabilities. Evaluations across base models, including Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled, consistently show iGRPO outperforming standard GRPO under equivalent training conditions.
Ablation studies confirm the refinement wrapper’s adaptability beyond GRPO variants and highlight the benefits of a generative judge. These analyses also reveal that iGRPO alters learning dynamics by effectively delaying entropy collapse, contributing to more stable and reliable performance. The consistent gains observed across diverse reasoning benchmarks underscore the potential of iterative, self-feedback-based reinforcement learning for enhancing verifiable mathematical reasoning. This method employs a two-stage process where the model first generates multiple draft solutions and selects the most promising one based on a reward signal.
Subsequently, it refines this selected draft, conditioning further iterations on its prior success. This iterative, self-feedback mechanism demonstrably improves performance on complex mathematical problems. Applying iGRPO to the OpenReasoning-Nemotron-7B model, trained on the AceReason-Math dataset, achieved state-of-the-art accuracy of 85.62% on the AIME24 benchmark and 79.64% on the AIME25 benchmark.
Ablation studies indicate that the refinement process is broadly applicable, functioning effectively with various group-based policy optimisation methods and benefiting from the use of a generative judge. Furthermore, the technique appears to mitigate premature convergence by sustaining exploration during training.
The authors acknowledge that while iGRPO delays entropy collapse, final entropy levels remain comparable to those of standard methods, suggesting the gains are primarily attributable to enhanced mid-training exploration. Future research could explore the application of iGRPO to other reasoning tasks and investigate the potential for further improvements through modifications to the reward structure or refinement process.
These findings highlight the potential of iterative self-feedback in advancing verifiable mathematical reasoning capabilities within large language models. The achieved accuracy of 85.62% represents a significant step towards more reliable and consistent performance on challenging mathematical benchmarks, paving the way for more trustworthy and capable AI systems.
👉 More information
🗞 iGRPO: Self-Feedback-Driven LLM Reasoning
🧠 ArXiv: https://arxiv.org/abs/2602.09000
