Researchers are tackling the challenge of equipping artificial intelligence agents with robust physical reasoning capabilities within complex environments. Sean Memery and Kartic Subr, both from the University of Edinburgh, alongside their colleagues, present a novel method for distilling meaningful patterns from detailed simulation traces. This work addresses a critical limitation of current language model approaches, which often struggle with physics-based tasks due to a lack of grounded simulation understanding. By synthesising programs to identify coarse-grained patterns, such as rigid-body collisions or stable support, from simulation logs, the team demonstrates a significant improvement in natural language reasoning about physical systems and enables more effective reward program generation from human-specified goals.
Discovering Abstracted Physics Patterns from Simulation for Improved Language Model Reasoning
Researchers have developed a new method to enhance artificial intelligence’s ability to reason about physics-based interactions, bridging the gap between natural-language instructions and complex simulations. This work addresses a critical challenge in AI: enabling agents to not only perceive and act within physical environments, but also to understand and respond to human guidance expressed in natural language.
The study introduces a technique for automatically discovering coarse-grained patterns, such as ‘rigid-body collision’ or ‘stable support’ , directly from detailed simulation logs. Specifically, the team synthesises programs that operate on these simulation logs, effectively mapping raw data into a series of high-level, activated patterns.
Through testing on two physics benchmarks, the research demonstrates that this annotated representation of simulation data significantly improves a Language Model’s capacity for reasoning about physical systems. This approach moves beyond simply providing simulation traces as context, which can be computationally expensive due to the large volumes of data involved.
The innovation lies in the creation of a pattern library, populated with code that can detect these high-level patterns within a simulation. This library then annotates simulation traces, creating a matrix of pattern activations that are more easily interpreted by Language Models. Consequently, these models can generate effective reward programs from natural language goals, offering a pathway for improved planning and supervised learning within complex environments. An example application involves a game designer specifying a goal like “Get the green ball in the second bucket, by passing between the two obstacles”, with the system automatically generating an optimised reward program to achieve this.
Automated pattern detection and temporal annotation of simulation data
Evolutionary programming synthesised programs to detect coarse-grained patterns from detailed simulation logs, forming the core of this research. The process began with a set of text descriptions from a domain expert defining potential high-level patterns, such as ‘rigid-body collision’ or ‘stable support’.
These descriptions then drove the automated synthesis of programs designed to identify activations of these patterns within raw simulation states. The resulting programs functioned as detectors, scanning simulation traces to pinpoint specific frames where defined patterns occurred. A crucial innovation was the creation of an annotation matrix representing pattern activations over time.
This matrix documented the presence or absence of each pattern at each simulation frame, effectively transforming the high-dimensional, fine-grained simulation data into a more manageable, interpretable format. The annotation process involved applying the detector code to the simulation trace, generating a record of pattern occurrences.
This annotated representation then served as input for downstream tasks, including summarisation, physics planning, question answering, and reward program synthesis. Two physics benchmarks were used to evaluate the efficacy of this method. The research demonstrated that Language Models (LMs) could more effectively reason about physical systems when presented with these annotated traces, compared to raw simulation data.
Specifically, the annotated traces enabled LMs to generate effective reward programs from natural language goals, such as “Get the green ball in the second bucket, by passing between the two obstacles”. These reward programs were then optimised to solve the specified tasks, showcasing the method’s potential within planning and supervised learning contexts.
Automated pattern discovery facilitates natural language reasoning about physical simulations
Synthesized programs successfully detect coarse-grained patterns from detailed simulation logs, enabling improved natural language reasoning about physical systems. These programs map simulation logs to a series of high-level activated patterns, demonstrating a novel approach to representing complex physical interactions.
The research focuses on discovering patterns such as ‘rigid-body collision’ and ‘stable support’ directly from simulation data. This annotated representation of simulation logs facilitates more effective natural language reasoning compared to using raw simulation traces. Through two physics benchmarks, the study demonstrates the method’s ability to generate effective reward programs from goals specified in natural language.
These reward programs can then be used within planning or supervised learning contexts. The work builds upon the insight that language models perform better when reasoning about high-level events rather than low-level simulation states. Additionally, the research leverages the effectiveness of language models in generating executable code that models environment dynamics and structure.
The approach utilizes FunSearch, a method for genetic programming, to synthesize pattern detectors. Candidate programs are scored based on how well their emitted event streams correlate with meaningful differences in trajectory geometry, while also discouraging redundancy. The evaluation function assesses candidate outputs and rejects invalid ones, ensuring the quality of the synthesized programs.
Experiments employed the Qwen3-VL 8B (Thinking) vision-language model with vLLM as the inference backend, and Qwen3-Coder 30B for the code evolution pipeline due to its superior programming capabilities. Comparisons were also conducted using different sizes of Qwen3-VL models and Nvidia’s Cosmos-2 Reasoning model, all built upon a Qwen3-VL backbone.
Automated discovery of physical patterns enhances agent reasoning and control
Researchers have developed a method to improve how artificial intelligence agents reason about physics-based environments through natural language interaction. The core innovation lies in automatically discovering and annotating coarse-grained patterns from detailed simulation logs, such as ‘rigid-body collision’ or ‘stable support’.
These patterns are then used to create a more manageable representation of complex physical events, facilitating improved performance in tasks requiring physical reasoning. This annotated representation of simulation data demonstrably enhances the ability of language models to interpret and reason about physical systems, specifically in question answering, summarization, and the Phyre benchmark.
Furthermore, the learned pattern library enables the synthesis of executable reward programs from natural language goals, supporting both optimisation and supervised machine learning approaches to agent control. The method offers a practical interface between physics environments and language models, bridging the gap between simulation data and natural language understanding.
The current demonstrations are limited to two-dimensional rigid body interactions within the Phyre environment, acknowledging that further empirical validation is needed to assess generalizability to more complex scenarios. Additionally, the experiments focused on relatively simple tasks utilising small libraries of seed patterns; the impact of larger, denser libraries on performance, potentially due to increased context length, remains an open question. Future work may explore the application of this approach to more complex environments and tasks, and investigate the scalability of the pattern library for broader applicability.
👉 More information
🗞 Discovering High Level Patterns from Simulation Traces
🧠 ArXiv: https://arxiv.org/abs/2602.10009
