Twinbrainvla Achieves Robotic Control by Resolving Catastrophic Forgetting in VLMs

Researchers are tackling the persistent challenge of building robots capable of both complex reasoning and dexterous physical manipulation. Bin Yu, Shijie Lian and Xiaopeng Lin, from ZGCA, alongside Yuliang Wei from HIT and Zhaolong Shen and Changti Wu from ZGCA, present TwinBrainVLA, a new architecture designed to overcome the limitations of current Language-Action Models (VLMs). Their work addresses the critical issue of ‘catastrophic forgetting’ , where robots lose general knowledge when learning specific skills , by employing a unique ‘twin brain’ system. This innovative approach coordinates a generalist VLM for broad understanding with a specialist VLM focused on embodied perception, allowing robots to perform intricate tasks while retaining comprehensive visual reasoning abilities, and represents a significant step towards truly versatile, general-purpose robotics.

Dual VLMs resolve robotic control forgetting by retaining

Scientists have unveiled TwinBrainVLA, a groundbreaking new architecture designed to overcome a critical limitation in Vision-Language-Action (VLA) models used for robotic control. The research addresses the inherent conflict between maintaining broad semantic understanding and acquiring the precise sensorimotor skills necessary for physical manipulation, a challenge that often leads to “catastrophic forgetting” of a model’s general capabilities. This innovative approach coordinates a generalist VLM, responsible for universal semantic understanding, with a specialist VLM dedicated to embodied proprioception, enabling joint robotic control with unprecedented efficacy. TwinBrainVLA synergizes a frozen “Left Brain”, retaining robust visual reasoning, with a trainable “Right Brain”, specialized for embodied perception, through a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism.
The core innovation lies in the decoupling of cognitive and physical skills, mirroring the functional specialization observed in the human brain’s hemispheres. The “Right Brain” dynamically queries semantic knowledge from the frozen “Left Brain” and fuses it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. This design ensures the action expert receives spatially rich, task-aligned embeddings while the “Left Brain” explicitly preserves the model’s general semantic understanding. Extensive experiments conducted on the SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baseline models.

The study establishes a novel VLA architecture, the first to explicitly separate general semantic understanding from embodied perception via an asymmetric dual-stream design, resolving the training conflict inherent in single-backbone VLAs. Researchers introduced the AsyMoT mechanism for information interaction between two isomorphic VLM pathways, building the VLA backbone and employing an asymmetric parameter freezing strategy to facilitate joint model training. This allows for joint attention over hidden states without sharing parameters, maximizing efficiency and performance. The work opens exciting possibilities for building general-purpose robots capable of simultaneously achieving high-level semantic understanding and low-level physical dexterity.

Furthermore, comparative experiments and evaluations on the SimplerEnv and RoboCasa benchmarks conclusively demonstrate the effectiveness of the TwinBrainVLA architecture, the AsyMoT mechanism, and the proposed training strategy. The team achieved superior manipulation performance while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for advanced robotics and artificial intelligence. This breakthrough reveals a pathway towards robots that not only understand instructions but also execute them with precision and adaptability, paving the way for more versatile and intelligent machines.

Dual-brain architecture for continual robotic learning enables more

Scientists introduced TwinBrainVLA, a novel architecture designed to address catastrophic forgetting in robotic control systems. The research team engineered a dual-stream system comprising a frozen “Left Brain” and a trainable “Right Brain” to decouple general semantic understanding from embodied perception. This innovative approach draws inspiration from hemispheric lateralization in the human brain, allocating distinct cognitive functions to specialized pathways. The Left Brain, a pre-trained Vision-Language Model (VLM), retains robust visual reasoning and instruction-following capabilities, remaining unchanged throughout the training process.

Experiments employed an Asymmetric Mixture-of-Transformers (AsyMoT) mechanism to seamlessly fuse information from both brains. AsyMoT enables joint attention over hidden states without parameter sharing, allowing the Right Brain to dynamically query semantic knowledge from the frozen Left Brain. The Right Brain, also initialized as a VLM, is fully trainable and specializes in embodied perception, processing proprioceptive states alongside visual inputs. This design ensures the action expert receives spatially rich, task-aligned embeddings from the Right Brain, while the Left Brain explicitly preserves the model’s general semantic understanding.

The study pioneered a flow-matching action expert conditioned on the specialized representations from the Right Brain to generate precise continuous controls. Researchers meticulously configured the system to leverage the strengths of both VLM pathways, creating a synergistic effect that surpasses conventional VLA models. Extensive comparative experiments were conducted on the SimplerEnv and RoboCasa benchmarks to rigorously evaluate the performance of TwinBrainVLA. The team demonstrated that their architecture achieves superior manipulation performance compared to state-of-the-art baselines, while simultaneously preserving the comprehensive visual understanding capabilities of the pre-trained VLM.

Furthermore, the work introduces an asymmetric parameter freezing strategy to facilitate the joint training of the dual models. This technique ensures that the Left Brain’s knowledge remains intact while the Right Brain adapts to the demands of robotic actuation. The innovative methodology enables the development of general-purpose robots capable of simultaneously achieving high-level semantic understanding and low-level physical dexterity, representing a significant step towards more versatile and intelligent robotic systems.

TwinBrainVLA boosts robotic control and understanding

Scientists have developed TwinBrainVLA, a novel architecture for robotic control that achieves superior manipulation performance while preserving comprehensive visual understanding. Experiments revealed that this system effectively coordinates a generalist visual language model (VLM) with a specialist VLM dedicated to embodied proprioception, enabling joint robotic control. The team measured performance on the SimplerEnv benchmark, achieving an average success rate of 58.4% with Qwen2.5-VL-3B-Instruct and 62.0% with Qwen3-VL-4B-Instruct. These results demonstrate a significant advancement in robotic dexterity and semantic understanding.

The core of TwinBrainVLA lies in its asymmetric design, utilising a frozen “Left Brain” for robust visual reasoning and a trainable “Right Brain” for embodied perception via an Asymmetric Mixture-of-Experts (AsyMoT) mechanism. Data shows that the Right Brain dynamically queries semantic knowledge from the Left Brain and fuses it with proprioceptive states, generating precise continuous controls. Specifically, on the “put spoon on towel” task, TwinBrainVLA with Qwen2.5-VL-3B-Instruct achieved a success rate of 83.3%, while the same configuration scored 77.1% on the challenging “put eggplant in the yellow basket” task. These measurements confirm the system’s ability to handle complex manipulation scenarios.

Further tests on the RoboCasa GR1 Tabletop Benchmark showcased the system’s adaptability to diverse tasks. Scientists recorded an average success rate of 54.6% across 24 tabletop manipulation tasks, surpassing the performance of Isaac-GR00T-N1.6, which achieved 47.6%. Notably, TwinBrainVLA with Qwen3-VL-4B-Instruct attained a 74.0% success rate on the “PnP Bottle To Cabinet Close” task and a 72.0% success rate on “PnP Can To Drawer Close”, demonstrating proficiency in complex object interactions. The breakthrough delivers a promising direction for building general-purpose robots capable of both high-level reasoning and low-level physical dexterity.

The research team utilised the Open X-Embodiment (OXE) dataset, specifically the Bridge-V2 and Fractal subsets, for training. Measurements confirm that TwinBrainVLA maintains comprehensive visual understanding capabilities of the pre-trained VLM, even while learning fine-grained sensorimotor skills. The framework’s strong generalisability across different VLM families is evidenced by the competitive success rates achieved with both Qwen2.5-VL-3B-Instruct and Qwen3-VL-4B-Instruct. This work paves the way for robots that can seamlessly integrate semantic knowledge with physical actions, opening up new possibilities for human-robot collaboration and automation.

Asymmetric Dual-Stream Architecture for Robot Control enables robust

Scientists have developed TwinBrainVLA, a new framework addressing the challenge of simultaneously achieving semantic understanding and embodied skill learning in robotic systems. This architecture decouples visual cognition by employing an asymmetric dual-stream design, inspired by hemispheric lateralization in the brain, to structurally separate these processes. The core of TwinBrainVLA lies in its “Left Brain”, a frozen visual language model retaining general reasoning abilities, and its trainable “Right Brain”, specializing in embodied perception and control. An Asymmetric Mixture-of-Transformers (AsyMoT) mechanism enables the Right Brain to query the Left Brain for semantic knowledge, fusing it with proprioceptive data to generate precise robotic actions, resulting in superior manipulation performance on benchmarks like SimplerEnv and RoboCasa.

Specifically, the best performing model achieved a 54.6% success rate, exceeding existing methods such as Isaac-GR00T-N1.6, QwenGR00T, and QwenPI by substantial margins of up to 10.7%. Researchers acknowledge limitations including the current requirement for identical architectures in both the Left and Right Brains, restricting flexibility in model pairing.  They are actively exploring generalized fusion mechanisms to accommodate heterogeneous models and are investigating the integration of specialized Embodied VLMs to initialize the Right Brain, potentially enhancing performance further. Furthermore, the team plans to scale training to the complete Open X-Embodiment dataset and extend evaluation to more benchmarks, including real-robot scenarios, to fully assess the versatility of TwinBrainVLA, demonstrating a promising direction for building general-purpose robots capable of both high-level reasoning and fine-grained physical dexterity.

👉 More information
🗞 TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
🧠 ArXiv: https://arxiv.org/abs/2601.14133

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Maml Co-Initialization Achieves Faster Active Noise Control with 2-Phase Inner Loops

Maml Co-Initialization Achieves Faster Active Noise Control with 2-Phase Inner Loops

January 22, 2026
Detect Advances Deep Learning Testing Via Latent Space Perturbation of Features

Detect Advances Deep Learning Testing Via Latent Space Perturbation of Features

January 22, 2026
Planetary Robotics Achieves Accurate Localisation Using Vision Foundation Models and Aerial Maps

Planetary Robotics Achieves Accurate Localisation Using Vision Foundation Models and Aerial Maps

January 22, 2026