Researchers are tackling the challenge of realistic video dubbing with a novel approach that moves beyond complex, task-specific pipelines. Anthony Chen from Tel Aviv University and Lightricks, alongside Naomi Ken Korem and Tavi Halperin from Lightricks, et al., present JUST-DUB-IT, a single framework adapting a foundational audio-visual diffusion model for video-to-video dubbing using a lightweight LoRA. This work is significant because it leverages the power of pretrained audio-visual models to generate both translated audio and synchronised facial movements, achieving improved visual fidelity, lip synchronisation, and robustness in dubbed videos, even with complex motion and real-world scenarios.

This single-model approach bypasses the complexities of traditional, multi-stage dubbing pipelines which often struggle with real-world video dynamics. The study establishes that this holistic modelling captures correlations between speech, facial movements, and scene dynamics, avoiding the limitations of separating audio and visual editing. The lightweight LoRA adaptation ensures the approach is simple, flexible, and practical for implementation.

This work opens new possibilities for seamless video localisation and accessibility, offering a significant advancement over current dubbing technologies. The ability to generate translated speech and corresponding lip movements jointly, while preserving the original visual context, addresses a critical challenge in video processing and enhances the viewer experience. Furthermore, the research addresses the issue of maintaining synchrony between environmental sounds and the newly generated speech, a common problem in modular dubbing systems. By treating the entire audio-visual stream as a single generative task, the model naturally adjusts the timing of background sounds to remain coherent with the translated dialogue. This holistic approach avoids the “messy” auditory-visual misalignments that often plague traditional methods, resulting in a more immersive and natural viewing experience. The team’s webpage detailing the project is available at https://justdubit. github. io, showcasing the capabilities of this innovative technology.

LoRA training via synthesised multilingual video pairs offers

The study pioneered a method for generating training data by creating multilingual videos with consistent speaker identity and voice characteristics across languages. Researchers then split these clips and used an audio-video inpainting framework to ensure both visual and audio context were leveraged, producing aligned bilingual video pairs for effective training supervision. This innovative approach circumvents the need for existing paired training data, which often lacks preservation of speaker identity and original visual content during language changes. Experiments employed a diffusion model, adapting it to perform dubbing directly, allowing audio and visual cues to be generated together and inform each other through cross-modality attention layers.

By treating the entire audio-visual stream as a single generative task, the system captures correlations between speech, facial motion, and scene dynamics, avoiding the assumptions and failure modes of modular pipelines. Importantly, the adaptation required only a small LoRA, demonstrating simplicity, flexibility, and robustness. Results demonstrate the method consistently produces high-quality dubbed videos, preserving facial and voice identity while maintaining accurate lip synchronisation and temporal-semantic coherence within the scene.

LoRA adapts foundation models for video dubbing

Measurements confirm that this approach consistently produces high-quality dubbed videos, preserving both facial and voice identity while maintaining accurate lip synchronization and temporal-semantic coherence. The work demonstrates a benefit of leveraging a strong joint audio-video generative prior for dubbing tasks. The study builds upon recent advances in audio-visual generative models, such as Diffusion Transformers like LTX-2, which processes video and audio as a unified signal. LTX-2 employs an Asymmetric Dual-Stream Diffusion Transformer, compressing video frames into 3D spatiotemporal tokens and audio into 1D tokens, with capacities allocated differently to each stream to manage information density.

The model is trained using Flow Matching, learning straight trajectories between data and noise distributions, and optimized by minimizing a regression loss of ∥vθ(xt,t,c) −(x1 −x0)∥2. Researchers adapted this robust, flow-based prior for video dubbing using Video In-Context Low-Rank Adaptation, allowing the model to steer synchronized generation toward target languages with minimal trainable parameters. The team constructed a paired dubbing dataset, addressing the lack of naturally occurring “perfect pairs” by leveraging the generative capacity of the pretrained model. This synthetic dataset, consisting of identical content in different languages, proved effective in training the model to generalize to diverse dubbing tasks, including those with varying lighting, pose, and expressive facial behaviours. Tests prove the approach is more robust to challenging conditions, such as non-frontal views and partial occlusions, resulting in improved overall perceptual quality.

LoRA enables synthesised multilingual video dubbing with surprising

Scientists have developed a new approach to video dubbing by framing it as a constrained audio-visual generation task. Experiments revealed that the full model generates novel lip movements aligned with the translated prompt, avoiding simple replication of source trajectories. The authors acknowledge that perfect preservation of speaker voice identity remains a challenge, suggesting a need for improved disentanglement of linguistic content and vocal style. Future research could extend this work to longer videos and more complex conversational scenarios, highlighting the potential of leveraging strong audio-visual priors for multimodal editing.

👉 More information
🗞 JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
🧠 ArXiv: https://arxiv.org/abs/2601.22143

Tags:

Lora

Just-Dub-It Achieves Video Dubbing Via Lightweight LoRA and Multilingual Synthesis

LoRA training via synthesised multilingual video pairs offers

LoRA adapts foundation models for video dubbing

LoRA enables synthesised multilingual video dubbing with surprising

Rohail T.

Latest Posts by Rohail T.:

Scalable Phonon Lasers Overcome Limitations for Focused Vibrational Control

Microstructure Predicts Qubit Coherence, Reducing Decoherence Loss by Two Orders of Magnitude

Fewer Atoms Needed: Light Emission Scales with One Divided by N Cubed