Predicting the physical properties of objects remains a fundamental challenge in computer vision, yet understanding how materials behave requires analysing motion over time. Guanqi Zhan, Xianzheng Ma, and Weidi Xie, alongside Andrew Zisserman and colleagues, now demonstrate a method for inferring dynamic physical properties, such as elasticity, viscosity, and friction, directly from video footage. The team created new video datasets encompassing both synthetic and real-world scenarios, and then explored how pre-trained video foundation models can estimate these properties. Their results reveal that these models, trained to generate or understand video content, achieve surprisingly accurate predictions, approaching the performance of methods using explicitly engineered visual cues, and suggest promising avenues for improving multi-modal models through careful prompt design.
The orange sphere in the video exhibits far greater elastic properties. It deforms slightly upon impact but quickly restores its original shape, converting potential energy into kinetic energy for a noticeable rebound. The sphere continues to bounce multiple times, with each bounce decreasing in height, characteristic of an elastic collision with some energy loss. In contrast, the yellow sphere falls and impacts the ground, deforming dramatically and flattening against the surface. It returns to its original shape, but with a very low bounce.
Therefore, the assessment indicates a clear difference in elasticity. Comparison result: 0. 8, aligning with established scoring patterns where a clear difference in elasticity is noted with a score around 0. 8.
PhysVid Dataset For Dynamic Property Assessment
This study pioneers a new methodology for assessing physical understanding in video foundation models, focusing on dynamic properties like elasticity, viscosity, and dynamic friction. To enable this research, scientists created PhysVid, a novel dataset comprising both synthetic and real-world videos, each meticulously annotated with ground-truth physical property values. Synthetic videos were generated using a physics simulator, while real-world examples were sourced from existing internet resources and captured in-house, ensuring a diverse and representative collection of visual data. Researchers implemented oracle methods for each physical property, granting access to visual cues directly indicative of the property being measured, defining an upper limit on inferential capability from visual input alone.
The core of the methodology involves evaluating three categories of video foundation models: generative models, self-supervised models, and multi-modal large language models (MLLMs). For generative and self-supervised models, scientists developed a lightweight readout mechanism that extracts dynamic physical properties from pre-trained, frozen representations, using a learnable query vector and cross-attention to selectively extract relevant information. For MLLMs, the team explored prompting strategies, including few-shot and procedural prompting to guide the model through oracle estimation steps, encouraging focus on intrinsic visual cues. This innovative approach allows for a comparative analysis of how effectively each model category captures and infers dynamic physical properties from video, moving beyond static appearance understanding.
Inferring Material Properties Directly From Video
This work presents a breakthrough in predicting dynamic physical properties directly from video footage, focusing on elasticity, viscosity, and dynamic friction. Researchers developed a novel approach to infer these properties, creating new video datasets with both synthetic and real-world examples for training and evaluation. Experiments demonstrate that both generative and self-supervised video models achieve strong performance on synthetic datasets, closely mirroring the accuracy of the oracle method, and exhibit good generalization to real-world scenarios for viscosity and elasticity. For relative value comparison tasks, the oracle estimator achieves near-perfect accuracy, confirming the inherent solvability of the task using visual cues and physics principles.
Results reveal that the generative and self-supervised models achieve a Pearson Correlation Coefficient of approximately 0. 82 to 0. 83 for elasticity prediction, demonstrating a strong ability to estimate this property from video. While friction proved more challenging, the study identified that reliance on visual references absent in real-world videos hinders generalization. The team also found that MLLMs can be improved through strategic prompting and the provision of additional context.
Inferring Material Properties Directly From Video
This research investigates the challenging task of inferring dynamic physical properties, elasticity, viscosity, and friction, directly from video footage. To facilitate this work, the team constructed a new benchmark dataset containing videos with ground truth annotations for these properties, encompassing both synthetic and real-world scenarios. They then evaluated a range of existing video foundation models, assessing their ability to predict these properties using different approaches, including direct readout and prompting techniques. Future research should focus on improving the ability of these models to accurately predict absolute values and to better generalise from synthetic to real-world data. This work represents a significant step towards creating machines that can understand and interpret the physical world as it appears in video, with potential applications in robotics, autonomous systems, and computer vision.
👉 More information
🗞 Inferring Dynamic Physical Properties from Video Foundation Models
🧠 ArXiv: https://arxiv.org/abs/2510.02311
