Researchers continue to grapple with the problem of efficient exploration in reinforcement learning, especially when dealing with environments offering limited rewards. Akshay Mete, Shahid Aamir Sheikh, and Tzu-Hsiang Lin, all from the Department of Electrical and Computer Engineering at Texas A&M University, alongside Dileep Kalathil and P. R. Kumar, present a novel framework called Optimistic World Models (OWMs) that addresses this challenge. Their work introduces a scalable method for optimistic exploration, integrating principles from reward-biased maximum likelihood estimation into reinforcement learning. Unlike conventional upper confidence bound approaches, OWMs directly embed optimism into the model itself, biasing predicted outcomes towards more rewarding scenarios and achieving substantial gains in sample efficiency and cumulative return when implemented within advanced world model architectures such as DreamerV3 and STORM.
The core innovation lies in a fully gradient-based loss function that requires neither uncertainty estimates nor constrained optimisation, streamlining the training process.
This approach is designed to be readily integrated with existing world model frameworks, preserving scalability while demanding only minimal adjustments to standard training procedures. Specifically, Optimistic DreamerV3 achieved a mean human-normalized score of 152.68%, representing a 55% improvement over the 97.45% achieved by DreamerV3.
These advancements are showcased across benchmarks including Private Eye, Enduro, and Montezuma’s Revenge, as illustrated in Figure 1, highlighting the potential for more effective learning in complex environments. Through this research, the team also draws a parallel between current world model frameworks and the certainty equivalence principle from adaptive control theory, identifying the closed-loop identification problem as a fundamental driver of the need for robust exploration strategies. The research directly implements RBMLE within a deep model-based reinforcement learning framework, enabling its application to large-scale problems previously inaccessible to this technique.
This involved augmenting standard world model training with an optimistic dynamics loss, which biases predicted transitions towards outcomes yielding higher rewards. Specifically, the study instantiated OWMs within two established world model architectures, creating Optimistic DreamerV3 and Optimistic STORM.
These models were trained using identical neural network architectures to their baseline counterparts, ensuring a fair comparison focused solely on the impact of the optimistic dynamics loss. The optimistic loss gently modifies transition probabilities, encouraging the world model to generate more favourable imagined trajectories during planning.
This approach avoids the computational complexities associated with upper confidence bound (UCB)-style exploration, such as non-convex constraints and the need for explicit uncertainty estimates. Performance was evaluated across several benchmarks, including Private Eye, Enduro, Montezuma’s Revenge, and Cartpole Swingup Sparse.
Experiments tracked cumulative return over 20 to 40 million steps, demonstrating significant improvements with the OWM variants. For instance, Optimistic DreamerV3 achieved a mean human-normalized score of 152.68%, representing a 55% improvement over DreamerV3’s 97.45% on the Atari100K benchmark. The study also presented results for sparse reward environments, showing substantial gains in sample efficiency and overall performance compared to standard world models.
Enhanced performance in sparse-reward environments using optimistic reinforcement learning
Optimistic DreamerV3 achieves a mean human-normalized score of 152.68% on the Atari100K benchmark, representing a substantial improvement over the 97.45% attained by DreamerV3. On sparse-reward environments within Atari100K, Optimistic DreamerV3 demonstrates gains of up to 268% compared to DreamerV3 across various games.
Specifically, performance increases of 1735% were observed on Private Eye, alongside gains of 125% on Frostbite and 45% on Krull. Optimistic STORM also exhibits notable performance, achieving a mean human-normalized score of 80.68% compared to STORM’s 75.90%. O-STORM notably achieved a positive score on the Freeway game, unlike STORM, DreamerV3, and O-DreamerV3, which all failed to surpass zero.
In the DeepMind Control suite, Optimistic DreamerV3 improved performance on sparse-reward environments such as Cartpole Swingup Sparse and Acrobot Swingup Sparse. Experiments on the DMC Proprio benchmark reveal that O-DreamerV3 achieves a 312% gain over DreamerV3 on Acrobot Swingup Sparse. Furthermore, on the DMC Vision benchmark, O-DreamerV3 demonstrates a 172% improvement on Acrobot Swingup Sparse.
Ablation studies on Cartpole Swingup Sparse indicate that the optimism term, α, requires careful tuning, with performance declining at a value of 0.1. The research also demonstrates that incorporating an entropy loss is beneficial, as evidenced by improved returns when compared to a model without this loss. This approach integrates optimistic principles from adaptive control, specifically reward-biased maximum likelihood estimation, directly into the model through an optimistic dynamics loss.
By encouraging imagined transitions towards more rewarding outcomes, the framework facilitates more efficient learning without requiring explicit uncertainty estimation or complex optimisation procedures. The framework’s adaptability allows for integration with various world model designs, offering a versatile solution for improving reinforcement learning performance.
Acknowledging limitations, the authors note that further analysis is needed to establish the convergence properties of this gradient-based approach to reward-biased maximum likelihood estimation. Future research may also focus on refining the design of the exploration parameter, potentially through the use of meta-controllers, to further enhance performance.
👉 More information
🗞 Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.10044
