The pursuit of artificial intelligence that seamlessly understands and generates visual content represents a major challenge, yet current open-source models often struggle to balance these capabilities effectively. Yanghao Li, Rui Qian, and Bowen Pan, alongside Haotian Zhang, Haoshuo Huang, and Bowen Zhang, address this issue with Manzano, a new framework that significantly improves performance across both visual understanding and image generation. The team achieves this by combining a novel hybrid image tokenizer with a carefully designed training process, allowing a single system to process images and text within a shared semantic space. This architecture demonstrates state-of-the-art results among unified multimodal models, and rivals specialist systems, particularly when evaluating performance on complex, text-rich tasks, marking a substantial step towards truly versatile artificial intelligence.
Large Multimodal Models, Progress and Benchmarks
Recent research into large multimodal models (LMMs) reveals significant progress across several key areas, establishing new benchmarks for evaluating performance. A prominent trend is the development of open-source LMMs, including InternVL, Qwen, xgen-mm, and Cambrian-1, alongside efforts to improve model efficiency, as demonstrated by the Minicpm-V model. Researchers are also exploring specialized techniques, such as diffusion and flow-based models, and refining methods for image generation and understanding, shifting towards more accessible, efficient, and capable systems. Instruction tuning is proving effective in enhancing performance, and studies suggest that the tokenizer, responsible for converting images into a format the model understands, plays a critical role in visual generation.
Hybrid Tokenizer Unifies Vision and Language Models
Researchers have engineered Manzano, a novel multimodal large language model, to overcome the performance gap often seen when combining visual understanding and image generation. The core of this work is a hybrid image tokenizer that generates both continuous and discrete visual representations from input images, utilizing a shared visual encoder to minimize conflicts between tasks. For understanding tasks, the tokenizer produces continuous embeddings, proven effective for detail-rich tasks like document question answering and chart interpretation. Conversely, for image generation, the system employs discrete tokens, allowing the large language model to leverage the same autoregressive next-token prediction strategy used for text, simplifying the generation process and improving scalability.
The architecture incorporates a diffusion decoder to translate predicted image tokens into high-fidelity pixels, utilizing a Diffusion Transformer within the latent space of a pre-trained variational autoencoder to enhance computational efficiency and preserve visual quality. Unlike conventional text-to-image diffusion models, Manzano leverages visual token embeddings generated by the large language model as conditioning signals for the diffusion process. The team designed a unified autoregressive objective, training the large language model to predict text, continuous image embeddings, and discrete image tokens without auxiliary losses. To align the continuous and discrete visual representations, the researchers randomly sampled output during training and passed it to a small language model decoder for alignment, maintaining simplicity and scalability. This design readily leverages mature techniques, enabling independent scaling of both the base language model and the image decoder, ultimately achieving state-of-the-art results among unified multimodal models.
Manzano Achieves Unified Visual Understanding and Generation
Manzano represents a significant advancement in multimodal AI, demonstrating a unified model capable of simultaneously understanding and generating visual content. The model employs a novel architecture featuring a shared visual encoder coupled with two lightweight adapters, one continuous for understanding tasks and one discrete for image generation, originating from the same encoder to minimize task conflict within the language model. Initial experiments demonstrate that the 3-billion parameter Manzano model achieves competitive image generation performance while simultaneously delivering significantly improved understanding performance, particularly on benchmarks requiring precise visual perception. The team trained Manzano using a three-stage process, beginning with pre-training the hybrid tokenizer to align image features with the language model, followed by joint training on a mixture of text, image understanding, and image generation data, and concluding with supervised fine-tuning to enhance instruction following.
Results show substantial improvements across both understanding and generation benchmarks when scaling the language model decoder from 300 million to 30 billion parameters, demonstrating consistent gains from increased model size. Enlarging the diffusion decoder leads to significant gains in image structural integrity, as confirmed by large-scale human evaluations. Quantitative analysis reveals that Manzano achieves state-of-the-art performance on both understanding and generation tasks, surpassing other unified multimodal models on key benchmarks. Qualitative examples demonstrate the model’s ability to handle complex prompts, such as generating an image of “a bird flying below the elephant,” with comparable quality to leading models like GPT-4o. Ablation studies confirm minimal cross-task conflict during joint training, suggesting the architecture and training recipe effectively harmonize understanding and generation capabilities, even within a compact model.
Unified Visual and Text Generation Achieved
Manzano represents a significant advance in multimodal learning, demonstrating a unified framework capable of both understanding and generating visual content. The researchers developed a system that effectively combines a hybrid image tokenizer with a carefully designed training process, allowing a single model to process and create images and text with minimal performance trade-offs. This architecture utilizes a large language model to predict high-level semantics, expressed as both text and image tokens, and then employs a diffusion-based decoder to translate these image tokens into final pixel values. The results demonstrate that Manzano achieves state-of-the-art performance on visual understanding tasks and substantial gains in image generation compared to other unified models.
Importantly, the team observed minimal conflict between the understanding and generation capabilities, and scaling the model size consistently improved performance, validating the design choices. Beyond generation, Manzano also excels at image editing, enabling precise control over pixel-level details while following textual instructions. The authors acknowledge that further research is needed to explore the full potential of this approach, with future work planned to investigate conversational editing, reasoning capabilities, and integration with additional modalities. They suggest that the combination of a hybrid tokenizer, a unified autoregressive backbone, and an image decoder may offer even greater benefits for multimodal unification. These findings indicate that achieving both accuracy and creativity in a single model is possible with well-defined objectives and improved visual representations.
👉 More information
🗞 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
🧠 ArXiv: https://arxiv.org/abs/2509.16197
