Advances Robotic Manipulation: LaST Improves Action with Spatio-Temporal Reasoning

Recent advances in robotic manipulation are being driven by Vision-Language-Action (VLA) models, which promise greater adaptability and generalisation. Zhuoyang Liu, Jiaming Liu, and Hao Chen, from Peking University and CUHK, alongside Ziyu Guo, Chengkai Hou and Chenyang Gu, present a new framework called LaST, designed to overcome limitations in existing VLA approaches. Their research addresses the trade-off between reasoning accuracy and inference speed, a critical factor for real-time robotic control, and the challenges of representing complex physical interactions solely through language. LaST introduces a Latent Spatio-Temporal Chain-of-Thought, enabling efficient and nuanced reasoning by modelling visual dynamics, 3D structure, and robot movements within a compact latent space. Through extensive testing across both simulated and real-world scenarios, the team demonstrate that LaST significantly improves success rates and inference speed compared to current state-of-the-art VLA methods.

Robotic Reasoning with Spatio-Temporal Chains

LaST0: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision, Language, Action Model introduces a novel framework designed to enhance robotic capabilities in understanding and executing complex tasks. LaST0 leverages a latent spatio-temporal chain-of-thought mechanism, enabling the robot to internally reason about the task’s sequential steps and spatial relationships. This approach facilitates improved generalisation and robustness in robotic vision-language-action tasks, moving beyond simple instruction following.

The core innovation lies in the model’s ability to decompose a high-level instruction into a series of intermediate, spatially-aware reasoning steps. By representing these steps as a latent chain, LaST0 can effectively bridge the gap between language understanding and action planning, allowing the robot to anticipate future states and adapt to unforeseen circumstances. Extensive experiments validate the effectiveness of LaST0, showcasing its superior performance on benchmark datasets compared to existing state-of-the-art methods. Furthermore, the framework incorporates a spatio-temporal representation that explicitly models the relationships between objects and their movements within the environment. This allows the robot to understand not only what to do, but also where and when to perform each action. The model’s architecture is designed to be modular and scalable, facilitating its integration into a variety of robotic platforms and applications, and confirming LaST0’s potential to significantly advance the field of robotic intelligence and human-robot interaction.

Latent Spatio-Temporal Reasoning for Robotic Manipulation

Scientists have developed LaST, a novel framework that significantly enhances reasoning capabilities in robotic manipulation through a Latent Spatio-Temporal Chain-of-Thought (CoT). The research team introduced a token-efficient latent CoT space designed to model future visual dynamics, 3D structural information, and robot proprioceptive states, extending these representations over time for temporally consistent reasoning. This approach captures fine-grained physical and robotic dynamics often difficult to express verbally, offering a substantial improvement over existing Vision-Language-Action (VLA) models. The core of LaST lies in its ability to compress high-dimensional sensory inputs into a latent sequence, avoiding the computational burden of decoding pixel-level images or lengthy text.

Researchers organized representations into an interleaved, chronological order, preserving causal physical dependencies and enabling flexible temporal granularity through keyframe extraction. This innovative sequence structure facilitates a holistic understanding of both the physical world and the robot’s internal state, allowing the model to learn coupled dynamics across modalities. To further optimise performance, the study implemented a dual-system design using a Mixture-of-Experts approach. A slow reasoning expert conducts low-frequency latent inference, while a fast acting expert generates high-frequency actions conditioned on robotics-oriented latent representations.

Tests demonstrated an 8% improvement in mean success rates across ten simulated manipulation tasks and a 13% increase in six real-world scenarios, compared to prior VLA methods. The team achieved this performance while simultaneously delivering substantially faster inference speeds. The breakthrough delivers a significant advancement in robotic control through asynchronous frequency coordination, decoupling the operating frequencies of the reasoning and acting experts. The slow expert operates at sparse keyframes, performing latent CoT reasoning, while the fast expert continuously generates actions based on the most recent latent output.

Measurements confirm an inference speed of 15.4Hz on a single RTX 4090 GPU with a 1:4 fast, slow frequency ratio, achieved by caching Key-Value states. Furthermore, the research demonstrates LaST’s robustness in long-horizon tasks, attaining a nearly five-fold higher success rate at the final step of multi-step real-world manipulations. This is directly enabled by the method’s ability to capture and propagate nuanced physical dynamics within the latent CoT, facilitating temporally consistent and accurate robotic control, and paving the way for more agile and perceptive robotic systems.

Latent Spatio-Temporal Reasoning for Robotic Control

LaST, a novel vision-language-action model, introduces a Latent Spatio-Temporal Chain-of-Thought to enhance efficient reasoning for robotic manipulation. By representing future visual dynamics, 3D structural information, and robot proprioception within a compact latent space, the framework overcomes limitations of prior methods reliant on explicit linguistic reasoning. This approach allows for temporally consistent reasoning without the associated latency, enabling more responsive and accurate robotic control. The research demonstrates improved performance across both simulated and real-world manipulation tasks, achieving an 8% and 13% increase in success rates respectively, alongside faster inference speeds.

Central to this achievement is a dual-system architecture, where a low-frequency reasoning expert guides a high-frequency action expert, facilitating adaptive operation and real-time responsiveness. The authors acknowledge limitations in the expressiveness of the latent reasoning space and suggest future work will focus on developing richer, more structured physical abstractions. Further research directions include exploring reinforcement learning to jointly optimise latent reasoning and action generation, and scaling the model to tackle more complex, long-horizon manipulation tasks involving delayed rewards and dynamic environments. The work represents a step towards more scalable and physically grounded reasoning within robotic foundation models, offering a promising pathway for advanced robotic control systems.

👉 More information
🗞 LaST : Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model
🧠 ArXiv: https://arxiv.org/abs/2601.05248

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Symmetry-based Quantum Sensing Enables High-Precision Measurements, Outperforming GHZ States

Symmetry-based Quantum Sensing Enables High-Precision Measurements, Outperforming GHZ States

January 13, 2026
Quantum Algorithm Enables Efficient Simulation of Sparse Quartic Hamiltonians for Time Horizons

Quantum Algorithm Enables Efficient Simulation of Sparse Quartic Hamiltonians for Time Horizons

January 13, 2026
Fermionic Fractional Chern Insulators Demonstrate Existence of Chiral Graviton Modes

Fermionic Fractional Chern Insulators Demonstrate Existence of Chiral Graviton Modes

January 13, 2026