Scientists are developing increasingly sophisticated robotic systems capable of interacting with the physical world, yet accurately modelling contact remains a significant challenge. Carolina Higuera, Sergio Arnaud, and Byron Boots, alongside colleagues from the University of Washington and FAIR, Meta, present a new approach in their work on Visuo-Tactile World Models (VT-WM), which integrates both visual and tactile data to improve understanding of robot-object interactions. By reasoning about contact through touch, VT-WM overcomes limitations of vision-only models when faced with occlusions or uncertain contact states, demonstrably enhancing object permanence and adherence to physical laws during simulated rollouts. This research is significant because it not only improves the fidelity of robot ‘imagination’ but also translates to substantial gains in real-world robotic manipulation, achieving up to 35% higher success rates in complex, contact-rich tasks and showcasing adaptability to novel scenarios.
By integrating vision with tactile sensing, these models achieve a more comprehensive understanding of robot-object interactions, particularly in scenarios where visual information is limited or ambiguous.
The research addresses common issues in vision-only models, such as objects seemingly disappearing, teleporting, or exhibiting unrealistic movements. Trained across a range of contact-rich manipulation tasks, VT-WM demonstrates improved fidelity in predicting future states, achieving 33% better performance in maintaining object permanence and 29% better compliance with the laws of motion during simulated rollouts.
This advancement stems from grounding the model in contact dynamics, which also translates to enhanced planning capabilities. Zero-shot real-robot experiments reveal that VT-WM achieves up to 35% higher success rates, with the most significant improvements observed in complex, multi-step tasks requiring sustained contact.
The model’s ability to adapt to novel tasks with limited demonstrations highlights its versatility and potential for broader application. By combining visual and tactile information, VT-WM effectively addresses the limitations of relying solely on vision for robotic manipulation. The core innovation lies in the creation of a multi-task world model that simultaneously processes visual and tactile data.
Vision provides a global understanding of the robot’s environment and kinematics, while tactile sensing delivers crucial local information about physical contact. This fusion enables the model to accurately represent object permanence, even when objects are heavily occluded or visually ambiguous, as demonstrated in a cube stacking task.
The model’s ability to disambiguate visually similar states, using tactile feedback to determine actual contact and resulting motion, is a key feature of this work. Furthermore, the study demonstrates a clear link between improved imagination quality and real-world performance. The tactile grounding prevents hallucinations common in vision-only models, such as objects disappearing or moving without applied forces.
This enhanced fidelity in prediction directly translates to more reliable zero-shot planning, particularly in tasks like pushing, wiping, and placing, where consistent hand-object interaction is critical for success. The research introduces a novel approach to world modeling, leveraging the complementary strengths of vision and touch to create a more robust and versatile system for robotic manipulation.
Visuo-tactile model training and evaluation using contact-rich manipulation tasks
A multi-task visuo-tactile world model, VT-WM, was developed to capture the physics of contact through touch reasoning. The research focused on improving robot-object interaction understanding, particularly in scenarios with occlusion or ambiguous contact. Training proceeded across a suite of contact-rich manipulation tasks, including placing fruits, pushing fruits, wiping with cloth, stacking cubes, and scribbling with a marker.
Performance was evaluated by assessing the model’s ability to maintain object permanence and adhere to the laws of motion during autoregressive rollouts. To quantify fidelity in imagination, the study computed a sampling loss, Lsampling, defined as the sum of L1 norms between sampled and ground-truth states across a horizon H, typically ranging from 3 to 5.
Predicted states were generated without gradient updates to ensure training stability. This sampling loss was combined with a teacher loss, Lteacher, using equal weighting to form the final loss function, L. The action-conditioned predictor facilitated the implementation of a Cross-Entropy Method (CEM) for planning.
At each planning step, the CEM sampled a population of N action sequences over a horizon H, generating future latent states using the predictor. A cost function, based on energy minimization with respect to a goal image, assigned a score to each trajectory, often utilising the l2 distance between the final predicted visual latent state and the goal image’s latent representation.
The top-performing sequences were selected, the sampling distribution was updated, and this process iterated until convergence, with the best sequence then executed on a real robot in an open-loop manner. Experiments compared VT-WM rollouts to those from a vision-only world model, V-WM, using action sequences derived from real-world demonstrations. Object permanence was assessed using CoTracker to track keypoints and comparing the normalized Fréchet distance between ground-truth and imagined visual trajectories.
Enhanced object permanence and physical fidelity via visuo-tactile prediction
Trained across contact-rich manipulation tasks, the multi-task Visuo-Tactile World Model (VT-WM) demonstrated a 33% improvement in maintaining object permanence during imagination-based rollouts. Compliance with the laws of motion also increased by 29% in these autoregressive rollouts, indicating enhanced fidelity in predicted object behaviour.
This work introduces a system that captures the nuances of contact through touch reasoning, complementing vision to better understand robot-object interactions. The research leveraged tactile embeddings obtained from Digit 360 sensors and RGB embeddings from the Cosmos encoder. The model architecture comprises a vision encoder, a tactile encoder, and an autoregressive predictor, designed to fuse exocentric vision with tactile sensing for consistent future state generation.
Specifically, the predictor estimates next-step states, (sk+1, tk+1), based on current states (sk, tk) and actions (ak), utilising a 12-layer transformer with alternating attention mechanisms. In zero-shot real-robot experiments, VT-WM achieved success rates up to 35% higher than baseline models. The most substantial gains, reaching this 35% improvement, were observed in multi-step, contact-rich tasks, highlighting the model’s effectiveness in complex scenarios.
Tactile images were streamed at 30-60 frames per second, providing rich information about contact area features, including force, shape, and texture. Furthermore, the study demonstrated the versatility of VT-WM, successfully adapting learned contact dynamics to a novel task with only a limited set of demonstrations.
The predictor was framed as a supervised next-state estimation problem, using ground-truth future latents as targets, and encoded modalities with pretrained networks, Cosmos tokenizer and Sparsh-X. These models capture the nuances of contact through touch reasoning, addressing limitations of vision-only systems which can struggle with occlusions or ambiguous contact scenarios leading to unrealistic simulations of object behaviour.
VT-WM achieves enhanced fidelity in predicting object movement and maintaining object permanence during simulated manipulation tasks. Experiments demonstrate that grounding imagination in contact dynamics also improves robot planning capabilities. In real-robot trials, VT-WM achieved up to 35% higher success rates compared to vision-only models, particularly in complex, multi-step tasks involving significant physical contact.
Furthermore, the model exhibits versatility by adapting to new tasks with limited demonstration data, suggesting efficient learning and transfer of prior physical understanding. The authors acknowledge that the primary failure mode observed involved minor precision errors in task execution, rather than fundamental misunderstandings of spatial requirements.
Future research could focus on refining precision and addressing these minor inaccuracies to further enhance performance. This work establishes a strong foundation for developing more robust and adaptable robot manipulation systems grounded in a comprehensive understanding of physical contact, paving the way for more reliable performance in real-world scenarios.
👉 More information
🗞 Visuo-Tactile World Models
🧠 ArXiv: https://arxiv.org/abs/2602.06001
