Reinforcement Learning Advances: Benchmark Reveals Memory Rewriting Crucial for Partial Observability

Scientists are increasingly recognising that effective reinforcement learning demands more than simply remembering past experiences. Oleg Shchendrigin, Egor Cherepanov, and Alexey K. Kovalev from MIRIAI, alongside Aleksandr I. Panov, demonstrate this with new research exposing a critical flaw in current memory-augmented RL agents , a surprising inability to rewrite memories effectively, despite excelling at retention. Their work introduces a novel benchmark specifically designed to test continual memory updating under realistic, partially observable conditions, revealing that while recurrent models exhibit surprising robustness, modern structured and -based memories often struggle beyond basic recall tasks. This finding is significant because it highlights a fundamental limitation in how RL agents learn and adapt, paving the way for the development of more flexible and intelligent systems capable of balancing stable knowledge with the crucial ability to forget and relearn.

Panov, demonstrate this with new research exposing a critical flaw in current memory-augmented RL agents , a surprising inability to rewrite memories effectively, despite excelling at retention. Their work introduces a novel benchmark specifically designed to test continual memory updating under realistic, partially observable conditions, revealing that while recurrent models exhibit surprising robustness, modern structured and -based memories often struggle beyond basic recall tasks. This finding is significant because it highlights a fundamental limitation in how RL agents learn and adapt, paving the way for the development of more flexible and intelligent systems capable of balancing stable knowledge with the crucial ability to forget and relearn.

Continual Learning Needs Adaptive Memory Updating

Scientists have demonstrated a critical gap in the ability of current reinforcement learning agents to adapt to changing environments, revealing that memory retention alone is insufficient for effective decision-making. The research introduces new benchmarks designed to specifically test an agent’s capacity for continual memory updating under conditions of partial observability. This work highlights the necessity of balancing stable memory retention with the adaptive overwriting of outdated information, a capability often overlooked in existing benchmarks and agent architectures. To address this challenge, the team developed a novel benchmark comprising the Endless T-Maze and Color-Cubes environments, which isolate the ability to perform continual, selective memory updates beyond simple cue retention.
Endless T-Maze presents sequential corridors where each new cue immediately invalidates the previous one, demanding active memory overwriting, while Color-Cubes, available in Trivial, Medium, and Extreme variants, features stochastically teleporting coloured cubes, forcing agents to constantly update their internal map and disregard stale information. Through these tasks, researchers systematically evaluated three distinct families of memory-augmented RL agents: recurrent policies, transformer-based architectures, and structured external memories, providing a detailed characterisation of their strengths and limitations. Experiments reveal a surprising result: classic recurrent models, despite their relative simplicity, exhibit greater flexibility and robustness in memory rewriting tasks compared to modern structured memories, which only succeed under limited conditions, and transformer-based agents, which frequently fail beyond basic retention scenarios. This finding exposes a fundamental limitation of current approaches, suggesting that the architectural design of memory mechanisms significantly impacts their ability to adapt to dynamic environments.

The study establishes that explicit, adaptive forgetting mechanisms, such as learnable forget gates, are more effective than cached-state or rigidly structured memories in facilitating successful memory rewriting. This research not only highlights an overlooked challenge in reinforcement learning but also provides valuable insights for designing future RL agents capable of explicit and trainable forgetting. The introduced benchmarks offer a standardised method for evaluating memory mechanisms in partially observable tasks, paving the way for the development of more robust and adaptable artificial intelligence systems. Researchers introduced a new benchmark to specifically test continual memory updating under partial observability, revealing fundamental limitations in current reinforcement learning approaches. Experiments revealed that recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks compared to modern structured memories and -based agents. The team measured performance on the Endless T-Maze, a novel environment designed to assess an agent’s ability to update its memory while navigating increasingly complex scenarios.

Data shows that PPO-LSTM, SHM, and FFM agents achieved a perfect success rate of 1.00 ±0.00 in the simplest T-Maze task (n=1), requiring only initial cue writing and retention. However, GTrXL and MLP agents struggled, with success rates around 50%, indicating an inability to reliably solve the task. In the Trivial Color-Cubes case, PPO-LSTM achieved 0.52 ±0.10, while FFM, GTrXL, and SHM demonstrated perfect success, highlighting varying retention capabilities. These results confirm that the memory mechanisms within PPO-LSTM, SHM, and FFM are capable of retaining necessary information within the experimental parameters.

Further analysis focused on the ability of agents to cope with memory rewriting, using Endless T-Maze tasks with corridor lengths exceeding 1. The PPO-LSTM agent consistently achieved complete success in most Endless T-Maze tasks, while FFM succeeded only in fixed sampling mode, recording a 1.00 ±0.00. GTrXL and MLP consistently failed to produce meaningful results when memory rewriting was required. Measurements confirm that PPO-LSTM is robust enough to succeed in both predictable (Fixed) and stochastic (Uniform) settings, while FFM and SHM’s rewriting capabilities are limited to predictable scenarios.

Scientists recorded intermediate progress of SHM and GTrXL agents, revealing that both could partially navigate several corridors before failing, indicating limited short-term memory rewriting capacity. Tests prove that the baseline’s ability to handle memory rewriting is highly dependent on environmental predictability, with PPO-LSTM demonstrating superior generalization capabilities. The breakthrough delivers insights for designing future RL agents with explicit and trainable forgetting mechanisms, addressing a previously overlooked challenge in artificial intelligence.

LSTM flexibility beats complex memory approaches in many

Scientists have identified a critical gap in reinforcement learning: the ability of agents to not only retain information but also to adaptively rewrite memories as environments change. Researchers introduced a new benchmark designed to specifically test continual memory updating under conditions of partial observability, where agents must rely on stored memories rather than immediate sensory input. Their experiments compared recurrent neural networks, transformer-based models, and structured memory mechanisms, revealing that simpler recurrent models, specifically LSTMs, exhibit greater flexibility and robustness in memory rewriting compared to more complex structured memories and transformer agents. This work demonstrates that current reinforcement learning approaches often struggle with balancing stable memory retention and adaptive updating, a limitation highlighted by the failure of most baseline models on the new benchmark.

The findings underscore the importance of developing memory mechanisms capable of efficiently managing both the storage and overwriting of information, particularly in dynamic environments requiring continual learning. Ablation studies focusing on LSTM architecture revealed that gating mechanisms play a crucial role in successful memory rewriting, with RNN and GRU models achieving limited success even in simpler scenarios. The authors acknowledge that the benchmark focuses on specific partial observability settings and may not fully capture the complexities of all real-world environments. Future research should explore more diverse and challenging scenarios to further evaluate memory rewriting capabilities. Additionally, the team suggests investigating explicit and trainable forgetting mechanisms to enhance the adaptability of reinforcement learning agents, potentially leading to more robust and intelligent systems.

👉 More information
🗞 Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2601.15086

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Multilingual LLM Evaluations of 6,000 Prompts Advance Global Model Safety

Multilingual LLM Evaluations of 6,000 Prompts Advance Global Model Safety

January 27, 2026
Explainable AI Advances Machine Learning Reliability for Industrial Cyber-Physical Systems

Explainable AI Advances Machine Learning Reliability for Industrial Cyber-Physical Systems

January 27, 2026
Advances Simulations of Partially Coherent Light Transport by Orders of Magnitude

Advances Simulations of Partially Coherent Light Transport by Orders of Magnitude

January 27, 2026