Truncated Proximal Policy Optimisation (T-PPO) enhances the training efficiency of large language models used for complex reasoning tasks. T-PPO utilises Extended Generalised Advantage Estimation to learn from incomplete responses and independently optimises policy and value models, achieving up to a 2.5x speed increase on the AIME 2024 benchmark.

The pursuit of enhanced reasoning capabilities in large language models (LLMs) increasingly focuses on techniques that allow these models to generate extended chains-of-thought, mimicking human problem-solving processes. A key element in developing such models is reinforcement learning, specifically algorithms like Proximal Policy Optimization (PPO), which enables learning through iterative refinement. However, the computational demands of PPO escalate significantly with longer generated sequences, creating a bottleneck in training efficiency. Researchers at ByteDance Seed address this challenge with a novel approach detailed in their article, ‘Truncated Proximal Policy Optimization’. Their work introduces modifications to PPO, including Extended Generalized Advantage Estimation (EGAE), a method for evaluating incomplete responses during training, and a computationally optimised mechanism for independent policy and value model optimisation, ultimately accelerating the training of reasoning LLMs. The full author list is available in the article’s contributions section.

Large language models increasingly benefit from reinforcement learning, and the Versatile and Practical Optimisation (VAPO) system represents an advancement in techniques for training these models, demonstrably improving both training efficiency and performance across a range of complex reasoning tasks. VAPO addresses limitations inherent in traditional Proximal Policy Optimisation (PPO) methods, particularly the computational burden associated with processing lengthy sequences, through innovations in both algorithmic design and system architecture. PPO is a reinforcement learning algorithm used to train agents to make decisions within an environment to maximise a reward.

VAPO’s efficacy centres on truncated batch training, a technique that breaks down extended interaction sequences into smaller, manageable segments, significantly reducing memory requirements and accelerating training speeds. This allows the system to handle considerably longer contexts than conventional methods, a crucial capability for tasks demanding multi-step reasoning and calculation. Complementing this is the Extended Generalised Advantage Estimation (EGAE) algorithm, which accurately estimates advantages from incomplete responses, preserving the integrity of policy learning despite the truncated sequences. Advantage estimation is a technique used in reinforcement learning to determine how much better a particular action is compared to the average action in a given state.

The system necessitates a dataset of reasoning tasks to function, and the research outlines the characteristics of a suitable Math Reasoning dataset implied by its methodology, emphasising advanced reasoning beyond basic arithmetic. It requires problems spanning arithmetic, algebra, geometry, and potentially calculus, but crucially, emphasises word problems that demand translation into mathematical equations. Data input consists of problem statements, while the desired output is the solution, accompanied by a detailed, step-by-step derivation, ensuring a comprehensive understanding of the reasoning process.

Creating this dataset presents several challenges, as generating diverse and challenging problems requires careful consideration, and producing accurate, detailed step-by-step solutions is time-consuming and demands expertise. Ambiguity in natural language within word problems must be avoided through clear and precise wording, ensuring the model receives unambiguous input. Evaluation goes beyond simply checking the final answer, as the system needs to assess the validity of each step in the solution process, promoting robust and reliable reasoning.

To facilitate dataset creation, the research suggests leveraging existing resources like MathQA, GSM8K, SVAMP, ARES, and the SymPy symbolic mathematics library, providing a solid foundation for building a comprehensive dataset. SymPy is a Python library for symbolic mathematics, allowing for manipulation of mathematical expressions. The proposed methodology employs a computationally optimised mechanism that independently optimises policy and value models, reducing redundant computations and accelerating training. This optimisation relies on selectively filtering prompts and truncated tokens, further enhancing efficiency and allowing for faster iteration and experimentation.

Experimental results consistently demonstrate VAPO’s superior performance when benchmarked against existing reinforcement learning systems, including DeepSeek-R1, QWQ-32B, and Kimi-K1.5. Specifically, on the challenging GSM8K dataset—a benchmark for mathematical reasoning—VAPO achieves state-of-the-art results, surpassing previous models in its ability to solve grade-school level word problems. This success highlights the importance of long-context handling for tasks demanding multi-step reasoning and calculation, and validates the design choices made in VAPO’s architecture.

Beyond mathematical reasoning, VAPO exhibits strong performance on code generation (HumanEval) and commonsense reasoning (Big-Bench Hard), indicating its versatility across diverse cognitive domains. The system’s scalability is further evidenced by its ability to effectively train larger models and tackle increasingly complex tasks, positioning it as a valuable tool for advancing the capabilities of language-based artificial intelligence.

VAPO’s open-source availability on GitHub fosters collaboration and accelerates research within the field, empowering the community to build upon its innovations. By providing a readily accessible and efficient platform for reinforcement learning, VAPO empowers researchers and developers to explore new frontiers in language model reasoning and build more capable and intelligent systems. The system’s design prioritises problem diversity, detailed solution derivation, and careful attention to potential biases, creating a valuable resource for training and evaluating LLMs capable of advanced reasoning.

👉 More information
🗞 Truncated Proximal Policy Optimization
🧠 DOI: https://doi.org/10.48550/arXiv.2506.15050

Tags:

advantage estimation AIME 2024 Chain-of-Thought Large Language Models policy optimisation Proximal Policy Optimisation Reinforcement Learning Training Efficiency. Truncated Backpropagation Value Models

Quantum News

Efficient Reinforcement Learning Scales Reasoning in Large Language Models.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype