Unigen-1.5: Reward Unification in Reinforcement Learning Enhances Image Generation and Editing Performance

The creation of realistic and easily manipulated images remains a key challenge in artificial intelligence, and researchers are continually striving to improve both the quality of generated images and the precision of image editing tools. Rui Tian from Fudan University, alongside Mingfei Gao and Haiming Gang from Apple, and colleagues, now present UniGen-1. 5, a new multimodal model that significantly advances performance in both image generation and editing. The team achieves this breakthrough by unifying the training process using a single reward system within a reinforcement learning framework, allowing the model to simultaneously improve its ability to create and modify images. Results demonstrate that UniGen-1. 5 surpasses existing state-of-the-art models, achieving competitive scores on established benchmarks and approaching the performance of proprietary image generation systems.

The model architecture and training pipeline strengthen image understanding and generation capabilities, while unlocking strong image editing ability. Researchers propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, they propose a light Edit Instruction Alignment stage that significantly improves editing instruction comprehension, which is essential for the success of the RL training. Experimental results show that UniGen-1. 5 demonstrates competitive understanding and generation performance, achieving 0. 89 and 4. 31 overall scores.

Unified Training For Image Generation And Editing

This document details the development and evaluation of UniGen-1. 5, a unified framework for both text-to-image generation and image editing. The research focuses on achieving strong performance in both areas without sacrificing capabilities in the other. The model utilizes a single architecture for both tasks, streamlining development and potentially improving generalization. Training proceeds in stages, beginning with pre-training on large datasets like CC, SA, and IMN to establish a foundational understanding of images and text.

This is followed by supervised fine-tuning using datasets such as BLIP-3o, T2I-2M, and ShareGPT-4o-Image. A dedicated Edit Instruction Alignment stage then refines image editing capabilities, and finally, reinforcement learning with a combination of reward models further enhances performance. Key innovations include a specific conditioning strategy, ordering embedding concatenation as semantic visual embeddings, text embeddings, and low-level visual embeddings, for optimal image editing performance. The reinforcement learning stage utilizes an ensemble of reward models, including CLIP, ORM, UnifiedReward, and Human Preference, to provide comprehensive feedback and improve both image quality and alignment with the text prompt.

The model also employs a discrete detokenizer, which, while effective, can sometimes struggle with fine-grained details like text generation. Experiments demonstrate that UniGen-1. 5 achieves strong results on benchmarks for both text-to-image generation, like GenEval and DPG-Bench, and image editing, like ImgEdit. Ablation studies confirm the effectiveness of the conditioning strategy and reward function ensemble. Importantly, improving image generation and editing does not negatively impact the model’s image understanding abilities.

The research team acknowledges certain failure cases, particularly with rendering fine-grained details like text and maintaining consistent object identity during editing. Datasets used include CC-3M, CC-12M, SAM-11M, ImageNet for pre-training, BLIP-3o, T2I-2M, ShareGPT-4o-Image, GPT-Image-Edit-1. 5M for supervised fine-tuning, and T2I-R1, Edit-RL for reinforcement learning. In essence, this research presents a promising unified framework for both text-to-image generation and image editing, achieving strong performance through a carefully designed architecture, training strategy, and reward function ensemble.

UniGen-1. 5 Excels at Image Understanding and Generation

Scientists have developed UniGen-1. 5, a unified multimodal large language model that demonstrates advanced capabilities in image understanding, generation, and editing, representing a significant step forward in artificial intelligence. The research team designed an effective model architecture and training pipeline to enhance performance across all three areas within a single system, streamlining the process and improving overall efficiency. Experiments reveal that UniGen-1. 5 achieves a score of 0.

89 on the GenEval benchmark and 86. 83 on DPG-Bench, significantly outperforming recent models like BAGEL in image generation tasks. This demonstrates a substantial improvement in the model’s ability to create images from textual descriptions. To further refine image editing capabilities, the team introduced Edit Instruction Alignment, a post-training stage designed to improve the model’s comprehension of editing instructions. This stage optimizes the alignment between the instruction and the image semantics, resulting in more accurate and nuanced edits.

Results demonstrate that UniGen-1. 5 achieves an overall score of 4. 31 on the ImgEdit benchmark, surpassing other open-sourced models like OminiGen2 and achieving performance comparable to proprietary models such as GPT-Image-1. This highlights the model’s ability to perform complex image manipulations with a high degree of precision. A unified reinforcement learning strategy was also implemented, optimizing both image editing and generation tasks using shared reward models, unlocking improved performance across both domains. This approach leverages stable text-to-image reward models to jointly enhance the model’s capabilities, creating a synergistic effect.

UniGen-1. 5 Achieves Leading Multimodal Performance

UniGen-1. 5 represents a significant advancement in unified multimodal models, demonstrating strong performance across image understanding, generation, and editing tasks. Building upon the existing UniGen framework, researchers enhanced the model’s architecture and introduced a novel Edit Instruction Alignment stage, which substantially improves the model’s ability to interpret and execute image editing instructions. A unified reinforcement learning strategy, employing shared reward models, further optimizes both image generation and editing capabilities, resulting in improvements to both fidelity and controllability.

Extensive experimentation confirms that UniGen-1. 5 achieves state-of-the-art results on a variety of benchmarks. The team acknowledges certain limitations, notably a difficulty in rendering textual content within images and occasional visual inconsistencies in edited images. Future work will likely focus on integrating diffusion-based components to address the text rendering issue and developing dedicated reward models to enforce greater visual consistency during the editing process.

👉 More information
🗞 UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2511.14760

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Black Hole Maths Unlocks Secrets of How Energy Flows in Exotic Matter

Black Hole Maths Unlocks Secrets of How Energy Flows in Exotic Matter

February 10, 2026
Hidden Rules of Physics Revealed by Limiting Information Access

Hidden Rules of Physics Revealed by Limiting Information Access

February 10, 2026
Quantum AI Shortcut Could Speed up Language Models with Reduced Complexity

Quantum AI Shortcut Could Speed up Language Models with Reduced Complexity

February 10, 2026