Scaling supervised fine-tuning (SFT) data, particularly increasing the number of prompts, improves reasoning performance in large language models. Subsequent reinforcement learning (RL) further enhances these models, with optimal results achieved when maintaining a temperature-adjusted entropy of approximately 0.3 during training. The resulting AceReason-Nemotron-1.1 7B model surpasses existing Qwen2.5-7B-based models on complex mathematical and coding benchmarks.
The pursuit of artificial intelligence capable of robust mathematical and computational reasoning continues to drive innovation in machine learning. Recent research focuses on optimising the training methodologies for large language models, specifically exploring the combined benefits of supervised fine-tuning (SFT) and reinforcement learning (RL). This approach aims to create models that not only generate plausible text but also reliably solve complex problems. Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping detail their work in “AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy”, presenting a new model, AceReason-Nemotron-1.1, which demonstrates improved performance on challenging benchmarks when compared to its predecessor and other models based on the Qwen2.5-7B architecture. Their findings highlight the importance of both a strong initial supervised learning phase and careful calibration of exploration-exploitation balance during reinforcement learning.
Recent investigations reveal a systematic approach to augmenting the reasoning capabilities of large language models (LLMs), employing a sequential methodology of supervised fine-tuning (SFT) and reinforcement learning (RL). This research demonstrates that a combined strategy consistently outperforms either technique applied in isolation. Supervised fine-tuning, a process where the LLM learns from labelled examples, establishes a foundational understanding, while reinforcement learning subsequently refines this knowledge through a reward system, encouraging desired behaviours.
Crucially, the scaling of data during the SFT phase appears to be a key determinant of performance. Results indicate that increasing the number of unique prompts used during supervised learning yields more substantial gains than simply increasing the number of responses generated per prompt. This suggests that diversity in the training data, exposing the model to a wider range of reasoning challenges, is more beneficial than simply reinforcing existing patterns.
Reinforcement learning functions effectively as a corrective mechanism, particularly adept at addressing deficiencies present in the initial SFT model. Even models exhibiting weaker performance following supervised fine-tuning can achieve high levels of accuracy with effective RL training, diminishing the initial performance gap between weaker and stronger SFT baselines. This suggests that RL is not merely amplifying existing strengths, but actively rectifying weaknesses.
Optimising hyperparameters during reinforcement learning is also significant. The research identifies an optimal sampling temperature of approximately 0.3, balancing the need for exploration – discovering novel solutions – and exploitation – refining existing knowledge. A lower temperature encourages the model to stick to established patterns, while a higher temperature introduces more randomness, potentially leading to innovative but less reliable outputs.
Notably, the acquired reasoning skills demonstrate a degree of transferability. Performance gains observed on mathematical problem-solving benchmarks, such as AIME24 and AIME25, extend to coding challenges, specifically the LiveCodeBench V5 & V6 suites. This suggests the development of more general problem-solving abilities, rather than skills narrowly tailored to a specific domain. The evaluation metric, Pass@k, assesses the probability of solving a mathematical problem within k attempts, while problem-level solving rates provide a direct measure of success on coding benchmarks.
The resultant model, AceReason-Nemotron-1.1 7B, constructed upon the Qwen2.5-7B architecture, surpasses its predecessor, AceReason-Nemotron-1.0, and establishes a new performance benchmark amongst comparable models. This advancement underscores the efficacy of the combined SFT and RL approach, offering a promising pathway towards more robust and versatile LLMs.
👉 More information
🗞 AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
🧠 DOI: https://doi.org/10.48550/arXiv.2506.13284
