TextAtari establishes a benchmark evaluating agents on extended sequential decision-making using textual descriptions of Atari game states. Evaluations of large language models—Qwen2.5-7B, Gemma-7B, and Llama3-8B—reveal substantial performance differences compared to human players in tasks requiring planning over tens of thousands of steps, exposing limitations in long-term reasoning and state management.
The capacity for artificial intelligence to plan and execute strategies over extended timescales remains a significant challenge. Researchers are now evaluating progress in this area using a novel benchmark, TextAtari, which translates the visual complexity of classic Atari games into textual descriptions, demanding agents reason and act over sequences of up to 100,000 steps. This approach allows for assessment of large language models’ capabilities in sequential decision-making without the need for visual processing. The work, detailed in a paper titled ‘TextAtari: 100K Frames Game Playing with Language Agents’, is the result of a collaboration between Wenhao Li, Wenwu Li, and Di Wu from Tongji University; Chuyun Shen, Zixiao Huang, and Xiangfeng Wang from East China Normal University; Junjie Sheng from Huawei Cloud Huawei Technologies Co., Ltd.; Yun Hua from Shanghai Jiao Tong University; Wei Yin from Bank of Communications; and Hongyuan Zha from The Chinese University of Hong Kong, Shenzhen, and Bo Jin from Tongji University.
TextAtari Benchmark Evaluates Long-Term Planning in AI Agents
East Lansing, MI – Researchers have introduced TextAtari, a new benchmark designed to rigorously assess artificial intelligence agents operating within extended, sequential decision-making tasks. The benchmark transforms the visual information from classic Atari games into textual descriptions, creating a challenging environment that integrates sequential problem-solving with natural language processing (NLP) capabilities.
Comprising nearly 100 distinct tasks, each varying in complexity, available actions, and planning horizon, TextAtari establishes a standardised platform for evaluating the performance of language-based agents in complex scenarios. This addresses a critical need for benchmarks that move beyond short-term decision-making and accurately measure an agent’s ability to reason and plan over extended periods.
The researchers employed an unsupervised learning framework, AtariARI, to render the Atari games as text. This allows evaluation using large language models (LLMs) – sophisticated AI models trained on vast amounts of text data – and facilitates a deeper understanding of their capabilities in complex environments. The approach enables a direct assessment of how well LLMs can interpret textual information and translate it into effective long-term strategies.
A systematic evaluation was conducted using three open-source LLMs – Qwen2.5-7B, Gemma-7B, and Llama3.1-8B – utilising three distinct agent frameworks: zero-shot prompting (where the model receives only the task description), few-shot chain-of-thought reasoning (where the model is provided with a few examples of reasoning steps), and reflection reasoning (where the model reflects on its previous actions to improve future performance). Four specific scenarios – Basic, Obscured, Manual Augmentation, and Reference-based – were used to probe the impact of semantic understanding, instruction following, and the incorporation of expert demonstrations on agent decision-making.
Results demonstrate a substantial performance gap between current agents and human players in tasks requiring extensive planning. The study identifies key challenges for agents in areas such as sequential reasoning, maintaining accurate state tracking over many time steps, and developing effective strategic planning capabilities across tens of thousands of steps. While models exhibit some capacity for short-term decision-making, they struggle to maintain coherent strategies over extended horizons. Agents often struggle with state tracking in partially observable environments, highlighting the importance of developing methods for maintaining a consistent understanding of the environment over time.
TextAtari provides standardised evaluation protocols and baseline implementations, facilitating further research at the intersection of LLMs and long-term planning. The benchmark offers a valuable tool for researchers to compare different approaches and track progress over time, fostering a collaborative environment for AI development.
Future work should focus on developing agents capable of more effective long-term credit assignment – accurately attributing rewards to actions taken many steps prior. Exploration of hierarchical reinforcement learning techniques – which decompose complex tasks into smaller, more manageable sub-goals – also presents a promising avenue for improvement. Furthermore, research into methods for enhancing the robustness of agents to noisy or ambiguous textual descriptions is crucial for real-world applicability.
The benchmark establishes a new standard for evaluating agents tasked with long-horizon decision-making, specifically within the context of classic Atari games recast as text-based challenges. The framework translates visual game states into rich textual descriptions, creating a demanding testbed that links sequential decision-making with natural language processing.
👉 More information
🗞 TextAtari: 100K Frames Game Playing with Language Agents
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04098
