Videomama Achieves Accurate Video Matting with Dataset of 50K Real-World Videos

Researchers are tackling the persistent problem of accurately isolating foreground objects in real-world videos, a task hampered by the limited availability of labelled training data. Sangbeom Lim from Korea University, alongside Seoung Wug Oh and Jiahui Huang from Adobe Research, et al., introduce VideoMaMa , a novel approach that transforms coarse masks into precise alpha mattes using generative priors learnt from pre-trained video models. This work is significant because VideoMaMa achieves robust performance on unseen footage despite training only on synthetic data, and crucially, the team have built upon this to create MA-V, a large-scale dataset of over 50,000 real-world videos with high-quality matting annotations. By fine-tuning SAM2 on MA-V to create SAM2-Matte, they demonstrate a substantial improvement in performance on challenging in-the-wild videos, highlighting the power of pseudo-labelling and accessible segmentation cues for advancing video matting research.

Synthetic Training Enables Zero-Shot Video Matting effortlessly

Building upon VideoMaMa’s capabilities, the team developed a scalable pseudo-labelling pipeline, constructing the Matting Anything in Video (MA-V) dataset, a substantial resource offering high-quality matting annotations for over 50,000 real-world videos. These videos encompass a wide range of scenes and motions, significantly expanding the diversity of available training data. To validate the effectiveness of MA-V, scientists fine-tuned the SAM2 model, resulting in SAM2-Matte, which demonstrably outperforms its counterparts trained on existing matting datasets in terms of robustness when applied to challenging, in-the-wild videos. This performance boost underscores the importance of large-scale, pseudo-labelled video matting and highlights the power of combining generative priors with accessible segmentation cues to accelerate progress in video matting research.
The research establishes a novel two-stage training strategy for VideoMaMa, optimising both spatial and temporal layers while injecting semantic knowledge via DINOv3 features. Experiments utilising binary masks from various sources confirm VideoMaMa’s consistent generation of high-quality video matting outputs, regardless of input mask type, showcasing its robust adaptability. Furthermore, the MA-V dataset, built by converting segmentation labels from the SA-V dataset, provides a valuable resource for the community, offering a substantial increase in the scale and realism of available video matting data. The work opens exciting possibilities for applications in video editing, background replacement, visual composition, and relighting, paving the way for more sophisticated and realistic video manipulation techniques.

VideoMaMa and the MA-V Dataset Creation

Scientists tackled the challenge of generalizing video matting to real-world footage by introducing Video Mask-to-Matte (VideoMaMa). This innovative model converts coarse segmentation masks into pixel-accurate alpha mattes, leveraging pre-trained video diffusion models to achieve strong zero-shot generalization, despite being trained solely on synthetic data. The research team engineered a scalable pseudo-labeling pipeline, constructing the Matting Anything in Video (MA-V) dataset, a resource containing high-quality matting annotations for over 50,000 real-world videos exhibiting diverse scenes and motions. To validate the MA-V dataset’s effectiveness, researchers fine-tuned the SAM2 model, creating SAM2-Matte.

This fine-tuned model demonstrably outperforms the original SAM2, as well as other existing video matting methods, in terms of robustness on in-the-wild videos. Experiments employed a diffusion-based model, VideoMaMa, which generates realistic video matting annotations from binary masks, enabling scalable label creation using readily available segmentation labels. The system delivers a crucial advantage by bypassing the need for manual annotation, a traditionally time-consuming and expensive process. The study pioneered a novel approach to video matting by harnessing the generative priors of pre-trained video diffusion models and training a robust pseudo-labeler with limited synthetic video matting annotations.

Specifically, the team formulated video matting mathematically as the alpha compositing equation: I = αF + (1 −α)B, where I represents the observed image, F the foreground, B the background, and α the alpha matte defining pixel-level opacity. VideoMaMa’s architecture incorporates a latent decoder and encoder, processing RGB frames and guide masks through video diffusion U-Net layers to generate high-quality mattes, semantic injection with DINO features further refines the training process. Furthermore, the researchers implemented a spatial and temporal layer within the latent encoder to capture motion dynamics and temporal coherence. This innovative design allows the model to accurately represent fine-grained details, such as hair strands and motion blur, critical for realistic video compositing. The resulting SAM2-Matte model, trained without architectural modifications, achieved substantially more robust performance than SAM2 models trained on existing datasets, demonstrating the power of large-scale pseudo-annotations to drive advancements in video matting research.

VideoMaMa generates mattes and a new dataset

Scientists have developed Video Mask-to-Matte (VideoMaMa), a diffusion-based model capable of converting coarse segmentation masks into pixel-accurate alpha mattes. The research addresses the scarcity of labelled data for real-world video matting by leveraging pre-trained video diffusion models and demonstrates strong zero-shot generalisation to real-world footage despite being trained solely on synthetic data. Experiments revealed that VideoMaMa consistently generates high-quality video matting outputs regardless of the input mask type, proving its robustness and potential as a pseudo-labeler. Building upon this capability, the team constructed the Matting Anything in Video (MA-V) dataset, comprising high-quality matting annotations for over 50,000 real-world videos.

This dataset spans diverse scenes and motions, offering a significant leap forward from existing datasets which primarily rely on synthetic or composited content, unlike those, MA-V provides annotations for real captured footage. Measurements confirm that MA-V is the first large-scale pseudo video matting dataset, efficiently constructed by converting segmentation labels from the SA-V dataset. To validate the effectiveness of MA-V, researchers fine-tuned the SAM2 model, resulting in SAM2-Matte. Tests prove that SAM2-Matte substantially outperforms the same SAM2 model trained on existing video matting datasets, as well as other current video matting methods, on in-the-wild videos.

Data shows that this performance improvement highlights the potential of large-scale pseudo-annotations to drive advancements in video matting research. The two-stage training strategy employed optimises both spatial and temporal layers, alongside semantic knowledge injection via DINOv3 features, enhancing the model’s ability to handle complex video sequences. Results demonstrate that the scalable pipeline developed can efficiently construct large-scale, high-quality video matting datasets by leveraging generative priors and easily obtainable segmentation labels. The breakthrough delivers a new approach to video matting, moving beyond domain-specific limitations and paving the way for more robust and generalisable solutions.

VideoMaMa and MA-V achieve state-of-the-art matting performance

Scientists have developed VideoMaMa, a diffusion-based model designed to perform robust video matting by leveraging generative priors. This research addresses the challenge of generalizing video matting to real-world footage, which is often hampered by a lack of labelled data. By exploiting VideoMaMa’s strong generalization capability, the researchers constructed the Matting Anything in Video (MA-V) dataset, a large-scale, pseudo-labelled video matting resource built from real-world videos and segmentation masks from SA-V.The findings demonstrate that both VideoMaMa and models trained on MA-V, such as SAM2-Matte, achieve state-of-the-art performance on diverse video matting benchmarks.

This validates the effectiveness of their data generation approach and highlights the importance of large-scale, pseudo-labelled datasets for advancing video matting research. The authors acknowledge that VideoMaMa can experience failures when relying on semantically inaccurate instance masks, representing a limitation of the current system. Future work could focus on improving the accuracy of initial segmentation masks to further enhance the robustness of the model.

👉 More information
🗞 VideoMaMa: Mask-Guided Video Matting via Generative Prior
🧠 ArXiv: https://arxiv.org/abs/2601.14255

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Locality Forces Equal Spacing in Quantum Many-body Scar Towers of States

Locality Forces Equal Spacing in Quantum Many-body Scar Towers of States

January 21, 2026
Mixing Established on Schreier Graphs, Demonstrating Ergodicity for Infinite Cayley Graphs

Mixing Established on Schreier Graphs, Demonstrating Ergodicity for Infinite Cayley Graphs

January 21, 2026
Functional Renormalization Group Advances Understanding of 2D Layered Quantum Materials

Functional Renormalization Group Advances Understanding of 2D Layered Quantum Materials

January 21, 2026