Researchers are tackling the computational burden currently limiting the widespread use of Multimodal Large Language Models (MLLMs) in universal multimodal retrieval. Qi Li, Yanzhe Zhao, and Yongxin Zhou, from Honor Device Co., Ltd, alongside Yameng Wang, Yandong Yang, Yuanjia Zhou et al., present a new approach, termed -MM-Embedding, which significantly improves both the efficiency and performance of these models. This work addresses a critical challenge, the substantial processing cost associated with visual inputs, by introducing an efficient MLLM architecture with visual token compression and a novel multi-stage training strategy. Consequently, -MM-Embedding not only restores multimodal capabilities but also surpasses existing methods in both accuracy and inference speed, paving the way for more practical and scalable multimodal applications.
These models address a critical limitation in current multimodal large language models, the substantial computational burden imposed by processing lengthy sequences of visual tokens.
Architectural Innovation: Visual Token Compression for Efficiency
Architectural Innovation Through Visual Token Compression
The research introduces an architecture incorporating visual token compression, reducing inference latency and memory requirements without sacrificing accuracy. This innovation allows for more practical application of these models in large-scale retrieval systems. At the core of this breakthrough is a synergistic combination of an efficient model architecture and a novel three-stage progressive training strategy.
The architecture employs a parameter-free spatial interpolation module to compress visual sequences by 75%, minimising token overhead and avoiding the complexities of trainable abstraction methods. This compressed representation is then refined through a carefully designed training pipeline, beginning with continue pretraining to restore foundational multimodal understanding and generation capabilities.
Enhancing Robustness Through Multi-Stage Contrastive Training
Following this initial phase, the research team implemented large-scale contrastive pretraining with hard negative mining to enhance the model’s ability to discriminate between relevant and irrelevant information. The training culminates in a task-aware fine-tuning stage, utilising a multimodal large language model as a ‘Judge’ to curate precise and challenging training data.
This coarse-to-fine approach ensures both robust performance and efficient learning. Comprehensive experiments demonstrate that Magic-MM-Embedding surpasses existing methods in performance while requiring significantly fewer visual tokens and exhibiting reduced inference latency. The resulting models establish a new state-of-the-art on natural image and visual document retrieval tasks.
Validating Superior Performance and Efficiency Gains
By achieving superior performance with only a quarter of the typical visual tokens, the research validates the effectiveness of the co-designed compression and training strategy. This work paves the way for deploying high-performing multimodal embedding models in latency-critical applications, such as large-scale search and recommendation systems.
Deep Dive: Interpolation and Progressive Training Mechanics
Spatial Interpolation and Progressive Training for Efficient Multimodal Embedding
Detailed Process: Interpolation and Progressive Training
A parameter-free spatial interpolation module forms the core of a new framework designed to improve multimodal embedding efficiency. This module projects long visual sequences into a compressed form, reducing the token overhead by 75% without relying on learnable abstractors. The study implements a three-stage progressive training strategy to restore and enhance multimodal understanding. Continue pretraining initially restores multimodal understanding and generation capabilities, preparing the model for more focused learning.
Subsequently, large-scale contrastive pretraining with hard negative mining enhances the model’s discriminative power, improving its ability to distinguish between relevant and irrelevant items. Finally, task-aware fine-tuning, guided by an MLLM-as-a-Judge, precisely curates data and optimizes performance for specific retrieval tasks.
This MLLM-as-a-Judge technique leverages the language model’s capabilities to evaluate and refine the training data, focusing on challenging “hard negative” examples. The framework was tested on MMEB, demonstrating state-of-the-art performance with reduced inference latency and fewer visual tokens. By decoupling architectural efficiency from training strategy, the research establishes a new benchmark for MLLM embedders in universal multimodal retrieval, overcoming the limitations of previous dual-tower architectures and addressing the quadratic scaling issue inherent in full-sequence integration.
Efficient multimodal embedding via visual token compression and coarse-to-fine training
Researchers developed a series of novel models, -MM-Embedding, achieving high efficiency and state-of-the-art performance in universal multimodal embedding. These models incorporate visual token compression, drastically reducing inference latency and memory footprint during processing of visual inputs.
The work demonstrates a coarse-to-fine training strategy that not only recovers multimodal understanding and generation capabilities but also significantly boosts performance across retrieval tasks. A key component of this research is a parameter-free visual token compression module inserted between the visual encoder and the connector.
This module employs bilinear interpolation on the spatial dimensions of the feature map, mitigating the computational bottleneck caused by long visual token sequences. The approach addresses the quadratic complexity of the large language model attention mechanism, improving processing speed and reducing memory requirements.
Extensive experiments validate the superiority of this holistic approach in creating a computationally efficient and highly effective model. The study utilizes a coarse-to-fine training pipeline, beginning with extensive continue pretraining to restore foundational multimodal abilities. This is followed by large-scale contrastive pretraining and hard negative mining to enhance discriminative power, culminating in task-aware fine-tuning guided by a large language model-as-a-judge for precise data curation.
This multi-stage process systematically builds robust discriminative power and achieves strong multi-task generalization. The research employs the InfoNCE loss function for model training, maximizing semantic alignment between queries and positive targets while suppressing negative samples. The models map inputs to a sequence of hidden states, with the final embedding obtained by applying l2 normalization to the hidden representation of the last token.
For a given query, a candidate set is defined including both ground-truth positive targets and negative samples obtained via in-batch sampling or hard negative mining. The objective is to minimize the negative log-likelihood of the positive target, weighted by a temperature parameter, relative to the negative samples. This approach establishes new state-of-the-art results, validating the effectiveness of the proposed framework for universal multimodal retrieval.
Efficient multimodal representation learning via compressed visual tokens and progressive training
Researchers have developed a new series of models, termed -MM-Embedding, that substantially improve the efficiency and performance of universal multimodal embedding. This advancement addresses a key limitation of multimodal large language models, which often struggle with the computational demands of processing extensive visual data.
The approach centres on an efficient multimodal large language model architecture that incorporates visual token compression, thereby reducing both inference latency and memory requirements. This work leverages a multi-stage progressive training strategy to optimise performance, beginning with continue pretraining to reinstate multimodal capabilities, followed by large-scale contrastive pretraining and hard negative mining to refine discriminative power.
The training culminates in task-aware fine-tuning, guided by a large language model functioning as a judge to ensure precise data curation. Experiments across multiple benchmarks, including Flickr30K, MSCOCO, ShareGPT4V, Urban1K, and SugarCrepe, demonstrate that the new model consistently outperforms existing methods, notably achieving a score of 91.6% on the SugarCrepe benchmark, a significant improvement over the 70.9% achieved by a comparable model.
Importantly, these gains are realised using only 64 visual tokens, a substantial reduction compared to standard methods, confirming that visual token compression enhances, rather than compromises, performance. The results demonstrate that visual token compression, when combined with the progressive training pipeline, improves inference efficiency and crossmodal alignment.
Comparative analysis of inference costs reveals that the new model exhibits significantly lower latency than other popular multimodal embedding models, reducing query processing time from 162.8ms to 29.9ms on the MMEB dataset for a 2B parameter model. While the authors acknowledge the computational resources required for training these large models, the reduced inference cost offers a practical advantage for real-world applications. Future research may focus on further optimising the training process and exploring the potential of this approach with even larger models and more complex multimodal datasets.
🗞 Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs
🧠 ArXiv: https://arxiv.org/abs/2602.05275
