Zero-shot Multilingual Retrieval Achieves 89.5% Recall with METAL Alignment

The limited availability of multimodal data in languages other than English presents a significant challenge for artificial intelligence systems, hindering their ability to process information across diverse linguistic landscapes. Piyush Singh Pasi from Amazon and colleagues address this issue with a novel approach, introducing METAL , a method that leverages the power of monolingual text to bridge the gap between languages and modalities. Their research demonstrates that a simple alignment technique, utilising only linear layers and English text, can effectively map multilingual text embeddings into a multimodal space, achieving comparable performance to existing methods in English. This breakthrough is particularly significant as it enables strong zero-shot transfer capabilities across eleven languages, opening doors to more inclusive and globally accessible multimodal AI applications, and the team have released code and datasets to encourage further investigation. Ultimately, this work represents a step towards building AI systems that truly understand and interact with the world in all its linguistic diversity.

The research team introduced M2M, a lightweight alignment method that leverages only English text to map text embeddings from multiple languages into a shared multimodal space. This innovative technique circumvents the need for costly and often unavailable multilingual image-text or audio-text resources, a common limitation in current multimodal models. Experiments show M2M matches baseline performance in English, achieving 94.9 percent Recall at 10, and delivers strong zero-shot transfer capabilities, averaging 89.5 percent Recall at 10 across 11 languages, including 10 previously unseen languages, on XTD text-to-image retrieval tasks.

The study unveils a surprisingly effective method for aligning multilingual representations with multimodal data, relying on a minimal set of linear layers. Researchers trained these layers using English text alone, effectively using it as a linguistic anchor to bridge the gap between languages and modalities. Qualitative analyses, using t-SNE visualizations, confirm that multilingual embeddings align tightly with multimodal representations, indicating a successful transfer of knowledge. Detailed weight analysis further reveals that the transformation reshapes the geometry of the embeddings, rather than simply rotating them, suggesting a more nuanced and effective alignment process.

This work establishes that substantial improvements in multimodal performance can be achieved through improved latent-space alignment, even with limited data and computational resources. Beyond image-text retrieval, the M2M method generalizes effectively to audio-text retrieval and cross-lingual text-to-image generation, demonstrating its versatility. To facilitate further research, the team released code, checkpoints, and newly constructed multilingual evaluation datasets, including MSCOCO Multilingual 30K, AudioCaps Multilingual, and Clotho Multilingual, providing a valuable resource for the wider scientific community. The research opens new avenues for developing truly multilingual multimodal models, particularly for low-resource languages where acquiring paired data is challenging. By decoupling multimodal learning from the need for extensive multilingual resources, M2M offers a scalable and efficient solution for building AI systems that can seamlessly process and understand information across languages and modalities. This breakthrough has the potential to significantly advance applications such as cross-lingual information retrieval, machine translation, and content creation, paving the way for more inclusive and accessible AI technologies.

Multilingual Text-Image Alignment via Linear Mapping

The study introduces METAL, a novel method for aligning multilingual text embeddings with multimodal representations, addressing the performance limitations of multimodal models in languages other than English. Researchers engineered a lightweight alignment technique employing only linear layers, trained exclusively on English text, to map embeddings from multiple languages into a shared multimodal space. This approach circumvents the need for extensive multilingual multimodal datasets, a significant constraint in current methodologies. The core innovation lies in leveraging the power of robust multilingual text encoders and focusing on latent space alignment rather than large-scale pretraining.

Beyond image-text retrieval, the research extended METAL’s applicability to audio-text retrieval and cross-lingual text-to-image generation, showcasing its generalizability. The study pioneered the release of several multilingual evaluation datasets, including MSCOCO Multilingual 30K, AudioCaps Multilingual, and Clotho Multilingual, to facilitate further investigation in the field. These datasets, alongside the released code and checkpoints, enable replication and expansion of the work. This methodological advancement offers a data-efficient and parameter-light solution for unlocking multilingual capabilities in multimodal models, relying on the power of English as a shared anchor for alignment.

METAL Enables Strong Zero-Shot Multilingual Transfer

Scientists achieved a breakthrough in multimodal model performance, specifically addressing the significant drop in capability observed when moving beyond English to other languages. The research team introduced METAL, a lightweight alignment method utilising only English text to map multilingual text embeddings into a multimodal space, demonstrating a novel approach to cross-lingual transfer learning. Experiments revealed that this method matches baseline performance in English, achieving 94.9 percent Recall at 10, while simultaneously delivering strong zero-transfer capabilities across 11 languages, averaging 89.5 percent Recall at 10 for unseen languages. This demonstrates a substantial advancement in multilingual multimodal understanding without requiring extensive multilingual training data.

The core of this work lies in the development of a linear layer-based alignment technique, requiring only approximately 1, 2 million parameters, which reshapes embedding geometry rather than simply rotating it. Measurements confirm that multilingual embeddings align tightly with multimodal representations, as visualised through t-SNE analysis, indicating a successful transfer of semantic information. Tests prove the method’s data efficiency, achieving strong performance with as few as 1,000 sentences, and its generalizability across architectures, modalities, and languages, including those not encountered during initial multimodal pre-training. The breakthrough delivers a pathway to leverage existing English-centric multimodal models for a wider range of languages.

Further experiments demonstrated METAL’s versatility beyond image-text retrieval, successfully generalizing to audio-text retrieval and cross-lingual text-to-image generation. To facilitate further research, the team released code and checkpoints, alongside newly constructed multilingual evaluation datasets, including MSCOCO Multilingual 30K with 270,000 samples, AudioCaps Multilingual containing 160,000 samples, and Clotho Multilingual with 172,000 samples. These datasets provide a unified and reproducible benchmark for evaluating multilingual multimodal models, enabling the wider scientific community to build upon these findings. The study highlights the potential of latent-space alignment to achieve comparable capabilities with significantly reduced data and computational demands.

Multilingual Alignment via Minimal Multimodal Mapping

This work introduces M2M, a novel and efficient method for aligning multilingual latent spaces with multimodal spaces. The researchers demonstrate that strong performance can be achieved using only a few linear layers and English text data, reducing the need for extensive multilingual or multimodal corpora. M2M successfully aligns representations, as evidenced by high recall rates in text-to-image retrieval, achieving 94.9 percent Recall at 10 for English and 89.5 percent averaged across eleven languages in zero-shot transfer scenarios. The significance of these findings lies in the potential to broaden the accessibility and effectiveness of multimodal models beyond English.

Qualitative analyses, including t-SNE visualizations, confirm the close alignment of projected multilingual embeddings with multimodal representations, and the method generalises to both audio-text retrieval and cross-lingual text-to-image generation. The authors acknowledge limitations relating to the absence of token-level alignment, which may restrict application to Multimodal Large Language Models requiring fine-grained representations. Future research directions include extending the framework to incorporate local alignment signals and further exploring joint cross-modal representations. To facilitate continued investigation, the researchers have released code, checkpoints, and newly created multilingual evaluation datasets encompassing MSCOCO Multilingual 30K, AudioCaps Multilingual, and Clotho Multilingual. These resources should prove valuable for the wider research community seeking to advance multilingual multimodal learning.

👉 More information
🗞 Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
🧠 ArXiv: https://arxiv.org/abs/2601.10096

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Variational Quantum Configuration Interaction Achieves Exact Ground States with Subspace Selection

Variational Quantum Configuration Interaction Achieves Exact Ground States with Subspace Selection

January 19, 2026
X Speedup Achieved with Parallelized Variational Quantum Eigensolver on Multi-GPU System

X Speedup Achieved with Parallelized Variational Quantum Eigensolver on Multi-GPU System

January 19, 2026
Hubble Detects Extremely Weak Stellar Wind in Tau Ceti with Mass Loss below 0.1 Solar Masses

Hubble Detects Extremely Weak Stellar Wind in Tau Ceti with Mass Loss below 0.1 Solar Masses

January 19, 2026