Scientists are addressing the challenge of computationally expensive unified multimodal models for image generation and editing, which typically require billions of parameters for effective performance. Dianyi Wang from Fudan University, Ruihang Li, and Feng Han, working with colleagues at Zhejiang University, University of Southern California, University of Science and Technology of China, and Shanghai Jiao Tong University, present DeepGen 1.0, a significantly more lightweight 5 billion parameter model. This research introduces Stacked Channel Bridging (SCB), an alignment framework designed to enhance semantic understanding and control within compact models, and a progressive three-stage training strategy. DeepGen 1.0 achieves competitive or superior results to much larger models, surpassing the 80 billion parameter HunyuanImage by 28% on WISE and the 27 billion parameter Qwen-Image-Edit by 37% on UniREditBench, offering a democratised, efficient, and high-performing alternative for the wider research community through open-source code, weights, and datasets.
This achievement challenges the prevailing trend of increasingly large models, often exceeding 10 billion parameters, that demand substantial computational resources for training and deployment. DeepGen 1.0 demonstrates comprehensive capabilities competitive with, and in some cases surpassing, significantly larger counterparts, offering a pathway towards more accessible and efficient multimodal research. Researchers addressed the limitations of compact models by focusing on semantic understanding and fine-grained control through innovative architectural design and a carefully curated training strategy. This three-stage process, Alignment Pre-training, Joint Supervised Fine-tuning, and Reinforcement Learning with MR-GRPO, synchronises visual and textual representations, fostering omni-capabilities and ensuring high-quality, artifact-free image creation and manipulation. Despite being trained on approximately 50 million samples, DeepGen 1.0 achieves leading performance on diverse benchmarks, notably surpassing the 80 billion parameter HunyuanImage by 28% on the WISE benchmark and the 27 billion parameter Qwen-Image-Edit by 37% on UniREditBench. This approach preserves both fine-grained visual details and high-level semantic understanding, enabling DeepGen 1.0 to handle complex instructions and reasoning-intensive tasks. By open-sourcing the training code, model weights, and datasets, the team aims to democratise unified multimodal research, providing an efficient and high-performance alternative for the wider scientific community. Achieving 87.90 on the DPG-Bench general instruction following benchmark, DeepGen 1.0 surpasses the performance of the 80 billion parameter HunyuanImage 3.0, which scored 86.10. On the WISE reasoning benchmark, DeepGen 1.0 attained a score of 0.73, representing a 28% improvement over the 80 billion parameter HunyuanImage 3.0’s score of 0.57. Furthermore, in the realm of image editing, DeepGen 1.0 achieved 77.5 on the UniREditBench, exceeding the dedicated 27 billion parameter Qwen-Image-Edit model’s score of 56.5 by over 37%. The entire training process for DeepGen 1.0 required approximately 50 million samples, a figure notably lower than the 1.2 billion samples used for LongCat-Image and the 5 billion samples employed for HunyuanImage 3.0. This data efficiency underscores the effectiveness of the study’s data-centric training strategy. DeepGen 1.0’s architecture comprises a 3 billion parameter VLM for understanding and reasoning, coupled with a 2 billion parameter DiT for generative tasks. The incorporation of learnable “think tokens” further enhances reasoning by creating an implicit chain of thought, allowing for more structured guidance of the generative process. SCB utilizes channel-wise concatenation and a lightweight connector to fuse these multi-source features into a dense multimodal conditional sequence. This design preserves both fine-grained visual details and high-level semantics, providing the DiT with richer, more informative guidance than methods relying on final VLM layers or average pooling. During alignment pre-training, only the connector and learnable think tokens were optimised, aligning VLM representations with the DiT’s latent space using large-scale image-text pairs and editing triplets. Subsequent joint supervised fine-tuning (SFT) involved unfreezing the DiT and applying LoRA to the VLM for end-to-end optimisation, leveraging a curated data mixture encompassing general and reasoning-based generation, editing, and text rendering. A streamlined connector module forms the core of DeepGen 1.0’s architecture, facilitating feature alignment between a pretrained Vision-Language Model (VLM) and a Diffusion Transformer (DiT). The VLM, specifically Qwen-2.5-VL with 3 billion parameters, provides robust multimodal understanding and world knowledge, while the DiT, initialised with SD3.5-Medium at 2 billion parameters, functions as a high-fidelity image generation decoder. This VLM-DiT paradigm leverages the strengths of both architectures, enabling complex multimodal prior capture and detailed image synthesis. SCB samples hidden states from six uniformly distributed layers within the VLM, spanning low, mid, and high levels of visual and textual processing. These multi-source features are then channel-wise concatenated and fused via a lightweight connector, creating a dense multimodal conditional sequence for the DiT. The inclusion of learnable ‘think tokens’ further enhances reasoning capabilities, acting as an implicit chain of thought within the model. The study employed a data-centric training strategy, progressing through three distinct stages to optimise VLM-DiT integration. Alignment Pre-training initially focused on synchronising VLM and DiT representations using large-scale image-text pairs and editing triplets, optimising only the connector and learnable think tokens. Subsequently, Joint Supervised Fine-tuning (SFT) unfroze the DiT and applied LoRA (Low-Rank Adaptation) to the VLM, enabling end-to-end optimisation across a curated mixture of generation, editing, reasoning, and text-rendering data. Finally, Reinforcement Learning (RL) with a novel Mixture of Rewards, Gradient Ratio Policy Optimisation (MR-GRPO) technique was used to align the model with human preferences, incorporating decoupled advantage normalisation and an auxiliary supervised diffusion loss to maintain broad capabilities. efficiency. DeepGen 1.0, a new multimodal model achieving state-of-the-art results with a comparatively modest parameter count, signals a potential turning point. For years, the field has been dominated by the assumption that scale is the primary path to improved performance, demanding immense computational resources and limiting access to a privileged few. This work demonstrates that intelligent architectural choices and focused training strategies can deliver comparable, and in some cases superior, results with a fraction of the parameters. The significance extends beyond mere computational savings, as a model of this size opens the door to deployment on less specialised hardware, potentially bringing advanced image generation and editing capabilities to a wider range of devices and applications. Consider mobile devices, embedded systems, or even wider access for smaller research groups. However, it is crucial to acknowledge that DeepGen 1.0 is not a panacea. While it surpasses larger models on specific benchmarks, the generalizability of these gains across all tasks remains to be fully established. Furthermore, the reliance on carefully curated datasets and a multi-stage training process highlights the ongoing need for high-quality data and sophisticated training techniques. The next logical step will likely involve exploring how these efficiency gains can be combined with emerging techniques like mixture-of-experts or pruning to further reduce model size without sacrificing performance. Ultimately, the challenge lies in building models that are not only powerful but also sustainable and accessible, and DeepGen 1.0 represents a valuable contribution towards that goal.
👉 More information
🗞 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
🧠 ArXiv: https://arxiv.org/abs/2602.12205
