Scaling up deep reinforcement learning algorithms presents a significant challenge as training demands increase, and researchers Preston Fu, Oleh Rybkin, and Zhiyuan Zhou, all from the University of California, Berkeley, alongside colleagues, now address this critical issue. Their work investigates how to best allocate computational resources, specifically network capacity and the frequency of updates, to maximise learning efficiency. The team demonstrates that simply increasing network size does not always yield better results, identifying a phenomenon called ‘TD-overfitting’ where excessively large batches can hinder accuracy, particularly with smaller networks. This research provides crucial guidelines for optimising compute usage in deep reinforcement learning, mirroring advances in supervised learning but tailored to the unique demands of temporal-difference learning, and offers a grounded starting point for scaling these powerful algorithms effectively.
Performance improvements per unit of compute have been extensively studied for language modelling, but reinforcement learning (RL) has received comparatively less attention in this regard. This paper investigates compute scaling for online, value-based deep RL, focusing on two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio.
Budget Allocation for Reinforcement Learning Agents
Researchers have investigated how to best allocate computational resources when training reinforcement learning agents. The study focuses on balancing model size and the update-to-data ratio to maximise learning efficiency given a fixed computational budget, a problem that has received less attention in reinforcement learning than in supervised learning. This work establishes a framework for understanding how to scale training effectively, mirroring approaches used in supervised learning but adapted to the unique characteristics of reinforcement learning. The findings reveal a complex relationship between model size, batch size, and the update-to-data ratio.
A key discovery is the identification of “TD-overfitting,” where increasing the batch size can actually reduce performance with smaller models. This occurs because larger batches can cause the agent to memorise specific experiences rather than generalising to new situations. However, this effect lessens as model size increases, allowing larger batch sizes to be used effectively at scale. Specifically, the research demonstrates that smaller models are more prone to overfitting with larger batches, leading to plateauing or even increased error rates. Conversely, larger models can leverage larger batch sizes to accelerate learning without significant performance degradation.
This suggests that the optimal batch size is not fixed, but rather depends on the capacity of the model, a crucial insight for efficient training. The team validated these findings using a variety of challenging control tasks and benchmark environments. They observed that the relationship between model size and optimal batch size consistently held across different tasks, demonstrating the generalizability of their approach. The results indicate that practitioners can significantly improve training efficiency by carefully tuning both model size and batch size, taking into account the specific characteristics of their reinforcement learning problem. This work provides a grounded starting point for compute-optimal scaling in deep reinforcement learning, offering practical guidelines for allocating resources and maximising learning performance.
Model Size and Update Frequency Trade-offs
Researchers have investigated how to best allocate computational resources when training deep reinforcement learning agents, focusing on the interplay between model size and how frequently the agent updates its understanding of the world, known as the update-to-data ratio. The core challenge is to maximise learning efficiency given a fixed computational budget. This work establishes a framework for understanding how to scale training effectively, mirroring approaches used in supervised learning but adapted to the unique characteristics of reinforcement learning. The study reveals a nuanced relationship between model size, batch size (a measure of how many experiences are processed at once), and the update-to-data ratio.
A key finding is the identification of a phenomenon termed “TD-overfitting,” where increasing the batch size can actually harm performance with smaller models. This occurs because larger batches exacerbate overfitting, causing the agent to memorise specific experiences rather than generalising to new situations. However, this effect diminishes as model size increases, allowing for effective use of larger batch sizes at scale. Specifically, the research demonstrates that smaller models are more susceptible to overfitting with larger batches, leading to plateauing or even increased error rates. Conversely, larger models can leverage larger batch sizes to accelerate learning without significant performance degradation.
This suggests that the optimal batch size is not fixed, but rather depends on the capacity of the model, a crucial insight for efficient training. The team validated these findings using a variety of challenging control tasks and benchmark environments. They observed that the relationship between model size and optimal batch size consistently held across different tasks, demonstrating the generalizability of their approach. The results indicate that practitioners can significantly improve training efficiency by carefully tuning both model size and batch size, taking into account the specific characteristics of their reinforcement learning problem. This work provides a grounded starting point for compute-optimal scaling in deep reinforcement learning, offering practical guidelines for allocating resources and maximising learning performance.
Scaling Laws and TD-Overfitting in Reinforcement Learning
This research establishes scaling laws for value-based deep reinforcement learning, offering a pathway to optimise computational resources during training. The study investigates how to best allocate compute between model size and the update-to-data ratio, revealing a nuanced interplay between these factors and batch size. Importantly, the team identified a phenomenon called TD-overfitting, where increasing batch size can harm accuracy for smaller models, but this effect diminishes as models scale, allowing for effective use of larger batches. The findings provide practical guidelines for choosing batch size, update-to-data ratio, and model size to maximise sample efficiency and achieve compute-optimal scaling, mirroring advances seen in supervised learning but adapted to the specific challenges of reinforcement learning.
While the research focused on challenging simulated robotic tasks, the authors acknowledge limitations in the number of hyperparameters explored due to computational constraints. Future work will expand the investigation to include additional hyperparameters like learning rate and extend the analysis to large-scale domains such as visual and language processing using even larger models. This work represents a step towards training reinforcement learning methods at a scale comparable to other modern machine learning approaches.
👉 More information
🗞 Compute-Optimal Scaling for Value-Based Deep RL
🧠 ArXiv: https://arxiv.org/abs/2508.14881
