Current multimodal large language models typically combine pre-existing vision and language components, making it challenging to understand how these systems truly scale with more data. Changyao Tian, Hao Li, and Gen Luo, along with their colleagues, address this problem by investigating the ‘native’ training of multimodal models, where both visual and language processing occur within a single, end-to-end system. Their research systematically explores the design choices and scaling behaviour of these native models under realistic data limitations, ultimately identifying an optimal architecture that balances performance and training cost. This work culminates in NaViL, a new native multimodal large language model, and reveals a strong correlation between the capabilities of the visual and language components, offering valuable insights for future development in this rapidly evolving field and demonstrating competitive performance across a range of benchmarks.
Vision-Language Tasks and Multimodal Evaluation
This collection of examples evaluates a vision-language model’s ability to process both images and text. The data comprises diverse prompts and corresponding responses, designed to test the model across a wide range of tasks, including image captioning, visual question answering, object recognition, and optical character recognition. Furthermore, the model demonstrates document understanding by analyzing content like contracts and extracting specific information, mathematical reasoning by solving problems based on diagrams, data interpretation by analyzing tables, and LaTeX conversion by recognizing and converting mathematical formulas in images. The inclusion of Chinese prompts demonstrates the model’s multilingual capabilities.
These examples vary in complexity, ranging from simple descriptions to complex reasoning tasks, and are based on real-world scenarios like receipts, contracts, and diagrams. This dataset highlights the importance of comprehensive evaluation for vision-language models. A successful model must accurately perceive visual information, understand language, reason and infer conclusions, handle complex problems, generalize to new scenarios, and process multiple languages. In summary, this data is valuable for evaluating vision-language models and driving progress in multimodal artificial intelligence.
Native Multimodal Training with Mixture of Experts
Researchers have pioneered a new approach to multimodal large language models by focusing on native training, optimizing vision and language spaces jointly in an end-to-end manner. This work systematically investigates the design choices and scaling properties of these native models, addressing challenges posed by limited data and large-scale training requirements. Findings revealed that appropriate language model initialization significantly improves training convergence on multimodal data, and combining visual encoder architectures with mixture-of-experts yields substantial performance gains over standard language models. Based on these insights, the team constructed a meta-architecture designed to optimally balance performance and training cost.
Further investigation focused on scaling properties, revealing that while language model scaling follows conventional patterns, visual encoder scaling exhibits an upper bound due to language model capacity limitations, indicating that optimal encoder size is dependent on language model size. This led to the development of NaViL, a native multimodal large language model built with a simple and cost-effective recipe. Extensive experiments were conducted across diverse benchmarks, including image captioning and optical character recognition, using approximately 600 million pre-training image-text pairs. The results demonstrate that NaViL achieves competitive performance compared to existing compositional models, highlighting its practicality and capabilities. This work establishes critical findings regarding language model initialization, visual encoder selection, and scaling relationships, encouraging future research in native multimodal large language models.
NaViL Demonstrates Efficient Multimodal Scaling Properties
Scientists have achieved a breakthrough in multimodal large language models through the development of a novel, natively trained model called NaViL. This work systematically investigates the design and scaling properties of these models, discovering that careful initialization of the language model component significantly improves training convergence. Further investigation revealed a positively correlated scaling relationship between the visual encoder and the language model, demonstrating that the optimal visual encoder size increases proportionally with the language model size on a logarithmic scale. This contrasts with traditional compositional approaches that employ a fixed-size visual encoder across different language model scales.
The team validated NaViL’s performance through extensive experimentation across diverse benchmarks, including image captioning and optical character recognition. Results demonstrate that NaViL achieves competitive performance compared to existing compositional models, utilizing approximately 600 million image-text pairs during pre-training. Specifically, NaViL achieves scores of 74. 7 on the Mono-InternVL benchmark, 74. 9 on the GQA benchmark, and 80.
4 on the ChartQA benchmark, as well as 51. 3 on the DesignQA benchmark, 78. 3 on the MMBench benchmark, and 62. 9 on the AI2D benchmark. These results highlight the practicality and capabilities of NaViL, paving the way for future advancements in natively trained multimodal large language models.
NaViL, A Scalable Native Multimodal Model
This research presents a systematic investigation into the end-to-end training of multimodal large language models, focusing on how design choices and scaling impact performance under realistic data constraints. The study reveals that initializing models with pre-trained language models, combined with visual encoders and a mixture-of-experts architecture, significantly improves overall capabilities. Importantly, the research demonstrates that scaling the visual encoder is limited by the capacity of the language model, a departure from traditional language model scaling approaches. Based on these findings, the team developed NaViL, a novel native model that achieves competitive results on a range of multimodal benchmarks, surpassing existing compositional models.
Analysis of the model’s attention mechanisms indicates that earlier interaction between visual and textual features promotes better alignment between modalities, offering insight into the performance gains achieved with larger encoder sizes. The authors acknowledge that their investigation was limited to models up to 9 billion parameters due to computational resources, and suggest that future work could explore scaling to even larger models, as well as incorporating a broader range of modalities beyond vision and language. This research provides valuable insights for the development of next-generation models and offers a deeper understanding of the interplay between visual and linguistic processing in these complex systems.
👉 More information
🗞 NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
🧠 ArXiv: https://arxiv.org/abs/2510.08565
