The challenge of translating intentions into precise physical actions lies at the heart of playing virtual reality games, a task humans accomplish with natural ease. Shuqing Li, Jiayi Yan, Chenyu Niu, and colleagues from the Chinese University of Hong Kong, along with Yun Peng and Wenxuan Wang, now investigate whether large language models can bridge this gap between semantic understanding and device manipulation. They introduce ComboBench, a new benchmark designed to rigorously evaluate the ability of LLMs to translate high-level game actions into the specific sequences required to control devices within virtual environments across popular games like Half-Life: Alyx and Moss: Book II. This research demonstrates that while models such as Gemini-1. 5-Pro exhibit promising task decomposition skills, they still fall short of human performance, particularly in areas requiring procedural reasoning and spatial awareness, revealing key areas for improvement in artificial intelligence’s ability to interact with virtual worlds.
LLMs Reason About Virtual Reality Interactions
This research details a comprehensive investigation into the capabilities and limitations of large language models (LLMs) in understanding and reasoning about interactions within virtual reality (VR) environments. The study explores how well LLMs can translate semantic actions, meaning-based instructions, into the precise physical manipulations required within a VR setting, and identifies key areas where current models fall short. Researchers conducted experiments across four diverse VR games: Vivecraft, Half-Life: Alyx, Blade and Sorcery, and Job Simulator. They defined complex action sequences, such as picking up an object, aiming at a target, and throwing it, then tasked LLMs with generating the steps needed to perform these sequences.
Performance was evaluated using metrics measuring semantic action identification, sequential order preservation, and strict sequence matching. The study also investigated the impact of providing examples and assessed performance across different games, conducting detailed error analysis to understand model limitations. The results demonstrate that LLMs struggle with sequencing actions correctly, achieving low scores on sequential order preservation. While they can identify what needs to be done, they often fail to determine when or in what order. Performance varies significantly across games, with LLMs performing better in simpler environments like Vivecraft and struggling with complex physics in games like Half-Life: Alyx.
Providing examples improves performance, but there is a limit, as models reach a point where more examples do not significantly improve results. Mixtral-8x7B showed the most consistent performance across games, while GPT-4o achieved the highest sequential order preservation in one game but performed poorly in others. These findings indicate a need for architectural innovations in LLMs to capture the temporal and causal reasoning required for complex VR interactions. Integrating visual and proprioceptive information could improve understanding, and training models on simulated VR experiences could provide the necessary experiential knowledge. Future research should prioritize developing methods for improving sequential reasoning in LLMs, and responsible development must consider safety, security, privacy, and equitable access. This research demonstrates that while LLMs show promise in understanding what actions are needed in VR, they currently lack the ability to reliably sequence those actions in the correct order.
Cognitive Taxonomy for Virtual Reality Instruction Following
The research team engineered ComboBench, a new benchmark designed to rigorously evaluate the capacity of large language models (LLMs) to translate high-level instructions into precise sequences of actions within virtual reality environments. To establish a robust evaluation framework, the study pioneered a six-dimensional cognitive capability taxonomy, developed through structured interviews with domain experts in cognitive science, spatial cognition, and embodied interaction. The selection of appropriate VR games for ComboBench involved a systematic process beginning with a query of the Steam store, filtering for “VR Only” titles available in English and sorted by user review scores. To populate ComboBench with realistic interaction scenarios, eight data annotators manually identified 262 distinct semantic actions from the collected walkthroughs. These semantic actions, defined as high-level goal-oriented tasks, were selected for their complexity and importance to game progression. Experienced VR users then played through each scenario, meticulously recording the precise sequence of device manipulations required for completion, detailing the device used and specific actions performed at each step. This detailed annotation process provides a rich dataset for evaluating LLM performance in translating semantic goals into physical actions.
LLMs Navigate Virtual Reality Game Actions
Researchers have introduced ComboBench, a new benchmark designed to evaluate how well large language models (LLMs) can translate high-level instructions into precise sequences of actions within virtual reality (VR) games. The team evaluated seven LLMs, assessing their performance across six key cognitive abilities: task decomposition, procedural reasoning, spatial reasoning, object interaction, motor action mapping, and judgment of termination conditions. Results demonstrate that all models exhibit strong capabilities in breaking down complex tasks into smaller steps, but struggle with procedural reasoning and accurately mapping abstract actions to specific VR controller commands.
Gemini-1. 5-Pro achieved the most balanced performance across all cognitive areas. The study revealed that providing LLMs with a few examples, specifically up to three, significantly improved their performance, particularly in understanding procedural steps. Performance also varied considerably depending on the game, with models generally performing better in environments with consistent interaction patterns, such as Vivecraft, compared to those requiring more nuanced controller manipulations, like Half-Life: Alyx. These findings highlight specific cognitive gaps in current LLMs and provide valuable insights for improving their ability to perform simulated embodied reasoning within immersive VR environments.
👉 More information
🗞 ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
🧠 ArXiv: https://arxiv.org/abs/2510.24706
