Combobench Evaluates LLMs’ Ability to Translate 262 Virtual Reality Actions into Device Manipulations

The challenge of translating intentions into precise physical actions lies at the heart of playing virtual reality games, a task humans accomplish with natural ease. Shuqing Li, Jiayi Yan, Chenyu Niu, and colleagues from the Chinese University of Hong Kong, along with Yun Peng and Wenxuan Wang, now investigate whether large language models can bridge this gap between semantic understanding and device manipulation. They introduce ComboBench, a new benchmark designed to rigorously evaluate the ability of LLMs to translate high-level game actions into the specific sequences required to control devices within virtual environments across popular games like Half-Life: Alyx and Moss: Book II. This research demonstrates that while models such as Gemini-1. 5-Pro exhibit promising task decomposition skills, they still fall short of human performance, particularly in areas requiring procedural reasoning and spatial awareness, revealing key areas for improvement in artificial intelligence’s ability to interact with virtual worlds.

LLMs Reason About Virtual Reality Interactions

This research details a comprehensive investigation into the capabilities and limitations of large language models (LLMs) in understanding and reasoning about interactions within virtual reality (VR) environments. The study explores how well LLMs can translate semantic actions, meaning-based instructions, into the precise physical manipulations required within a VR setting, and identifies key areas where current models fall short. Researchers conducted experiments across four diverse VR games: Vivecraft, Half-Life: Alyx, Blade and Sorcery, and Job Simulator. They defined complex action sequences, such as picking up an object, aiming at a target, and throwing it, then tasked LLMs with generating the steps needed to perform these sequences.

Performance was evaluated using metrics measuring semantic action identification, sequential order preservation, and strict sequence matching. The study also investigated the impact of providing examples and assessed performance across different games, conducting detailed error analysis to understand model limitations. The results demonstrate that LLMs struggle with sequencing actions correctly, achieving low scores on sequential order preservation. While they can identify what needs to be done, they often fail to determine when or in what order. Performance varies significantly across games, with LLMs performing better in simpler environments like Vivecraft and struggling with complex physics in games like Half-Life: Alyx.

Providing examples improves performance, but there is a limit, as models reach a point where more examples do not significantly improve results. Mixtral-8x7B showed the most consistent performance across games, while GPT-4o achieved the highest sequential order preservation in one game but performed poorly in others. These findings indicate a need for architectural innovations in LLMs to capture the temporal and causal reasoning required for complex VR interactions. Integrating visual and proprioceptive information could improve understanding, and training models on simulated VR experiences could provide the necessary experiential knowledge. Future research should prioritize developing methods for improving sequential reasoning in LLMs, and responsible development must consider safety, security, privacy, and equitable access. This research demonstrates that while LLMs show promise in understanding what actions are needed in VR, they currently lack the ability to reliably sequence those actions in the correct order.

Cognitive Taxonomy for Virtual Reality Instruction Following

The research team engineered ComboBench, a new benchmark designed to rigorously evaluate the capacity of large language models (LLMs) to translate high-level instructions into precise sequences of actions within virtual reality environments. To establish a robust evaluation framework, the study pioneered a six-dimensional cognitive capability taxonomy, developed through structured interviews with domain experts in cognitive science, spatial cognition, and embodied interaction. The selection of appropriate VR games for ComboBench involved a systematic process beginning with a query of the Steam store, filtering for “VR Only” titles available in English and sorted by user review scores. To populate ComboBench with realistic interaction scenarios, eight data annotators manually identified 262 distinct semantic actions from the collected walkthroughs. These semantic actions, defined as high-level goal-oriented tasks, were selected for their complexity and importance to game progression. Experienced VR users then played through each scenario, meticulously recording the precise sequence of device manipulations required for completion, detailing the device used and specific actions performed at each step. This detailed annotation process provides a rich dataset for evaluating LLM performance in translating semantic goals into physical actions.

LLMs Navigate Virtual Reality Game Actions

Researchers have introduced ComboBench, a new benchmark designed to evaluate how well large language models (LLMs) can translate high-level instructions into precise sequences of actions within virtual reality (VR) games. The team evaluated seven LLMs, assessing their performance across six key cognitive abilities: task decomposition, procedural reasoning, spatial reasoning, object interaction, motor action mapping, and judgment of termination conditions. Results demonstrate that all models exhibit strong capabilities in breaking down complex tasks into smaller steps, but struggle with procedural reasoning and accurately mapping abstract actions to specific VR controller commands.

Gemini-1. 5-Pro achieved the most balanced performance across all cognitive areas. The study revealed that providing LLMs with a few examples, specifically up to three, significantly improved their performance, particularly in understanding procedural steps. Performance also varied considerably depending on the game, with models generally performing better in environments with consistent interaction patterns, such as Vivecraft, compared to those requiring more nuanced controller manipulations, like Half-Life: Alyx. These findings highlight specific cognitive gaps in current LLMs and provide valuable insights for improving their ability to perform simulated embodied reasoning within immersive VR environments.

👉 More information
🗞 ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
🧠 ArXiv: https://arxiv.org/abs/2510.24706

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

High-Frequency Quantum Computing Achieves 10GHz Operation with Enhanced Coherence Times

High-Frequency Quantum Computing Achieves 10GHz Operation with Enhanced Coherence Times

February 3, 2026
Niobium Bilayers: XPS Demonstrates 17 Capping Layers Resist Surface Oxidation

Niobium Bilayers: XPS Demonstrates 17 Capping Layers Resist Surface Oxidation

February 3, 2026
Hyperrbm Achieves High-Fidelity Quantum State Tomography on 1D and 2D Lattices

Hyperrbm Achieves High-Fidelity Quantum State Tomography on 1D and 2D Lattices

February 3, 2026