Multimodal Fine-Tuning Achieves Enhanced Visual Understanding with Synthetic Captions

Researchers are tackling the disconnect between how artificial intelligence learns and how it is ultimately used, with a new method for enhancing image classification accuracy. Shohei Enomoto from NTT Tokyo, Japan, and Shin’ya Yamaguchi demonstrate a technique to convert standard image datasets into multimodal formats using synthetic captions generated by large language models. This approach bridges the gap between the increasingly multimodal pre-training of AI and the typically unimodal fine-tuning process, allowing models to better leverage richer, pre-trained visual understanding. By generating tailored captions and employing a novel contrastive loss function, their work achieves state-of-the-art results across thirteen image classification benchmarks, offering a significant advancement in dataset enhancement and a new paradigm for multimodal learning.

Synthetic captions boost unimodal to multimodal learning significantly

Scientists have demonstrated a novel approach to bridge the gap between multimodal pre-training and unimodal fine-tuning of deep neural networks, addressing a fundamental limitation in current computer vision methodologies. The research team achieved this by transforming unimodal datasets into multimodal ones, utilising Multimodal Large Language Models (MLLMs) to generate synthetic image captions specifically for fine-tuning models with a multimodal objective. This innovative method employs carefully designed prompts, incorporating both class labels and relevant domain context, to produce high-quality captions tailored for accurate Image classification tasks. Furthermore, the study introduces a supervised contrastive loss function that actively encourages the clustering of representations belonging to the same class during the fine-tuning process.

The core of this breakthrough lies in the creation of synthetic datasets, effectively augmenting existing unimodal data with rich textual information previously unavailable. Researchers leveraged MLLMs to generate captions that go beyond simple descriptions, incorporating nuanced details relevant to the classification task at hand. This process not only expands the dataset but also provides the fine-tuning model with a more comprehensive understanding of the visual content. A novel inference technique was also developed, which leverages class-averaged text embeddings derived from multiple synthetic captions per image, enhancing the model’s ability to generalise and discriminate between classes.

Extensive experiments conducted across thirteen diverse image classification benchmarks confirm the efficacy of this approach. The study reveals significant performance improvements, particularly in challenging few-shot learning scenarios where labelled data is scarce. The team’s method consistently outperformed baseline techniques, establishing a new paradigm for dataset enhancement that effectively unlocks the full potential of multimodal pre-training. This work establishes that by aligning pre-training and fine-tuning modalities, models can benefit from richer representations and achieve superior performance on downstream tasks.

The researchers’ approach not only improves accuracy but also demonstrates the potential for zero-training image classification, surpassing the performance of fine-tuned models trained with limited data, specifically 1, 4, and 8 shots per class. This innovation opens new avenues for applying powerful pre-trained models to a wider range of image classification problems, especially in situations where acquiring large, labelled datasets is impractical or costly. The availability of the code at https://github. com/s-enmt/MMFT further facilitates the adoption and extension of this research by the wider scientific community. By effectively bridging the gap between multimodal pre-training and unimodal fine-tuning, this work promises to accelerate progress in computer vision and unlock new possibilities for image understanding and analysis.

Synthetic captions and contrastive loss for fine-tuning image

Scientists developed a novel methodology to bridge the gap between multimodal pre-training and unimodal fine-tuning of deep neural networks, addressing a fundamental limitation in current computer vision practices. The research team transformed unimodal datasets into multimodal datasets by employing Multimodal Large Language Models (MLLMs) to generate synthetic image captions, enabling fine-tuning with a multimodal objective. These MLLMs were guided by carefully designed prompts that incorporated both class labels and relevant domain context, ensuring the production of high-quality captions specifically tailored for image classification tasks. To further refine the learning process, the study pioneered a supervised contrastive loss function during fine-tuning, explicitly encouraging the clustering of representations belonging to the same class within the embedding space.

This contrasts with standard contrastive learning methods, which can sometimes disperse similar images across the representation space. Experiments employed 13 image classification benchmarks to rigorously evaluate the performance of this approach. The team also introduced a new inference technique that harnessed class-averaged text embeddings derived from multiple synthetic captions generated per image, providing a richer and more nuanced understanding of each class. This innovative approach achieves substantial improvements, particularly in few-shot learning scenarios, where the model readily overfits to task-specific distributions.

The work establishes a new paradigm for dataset enhancement, effectively leveraging the benefits of multimodal pre-training during the fine-tuning stage. Researchers meticulously crafted prompts for the MLLMs, incorporating class labels and domain context to generate captions that are both accurate and informative. The supervised contrastive loss function was implemented to explicitly guide the alignment of semantically similar images, fostering a more cohesive and discriminative embedding space. Furthermore, the novel inference technique, averaging text embeddings across multiple synthetic captions, delivers a more robust and representative class signature for improved image classification. Extensive experiments demonstrate that this method outperforms baseline techniques, and notably, the generated captions enable zero-shot image classification that surpasses fine-tuned approaches using 1, 4, and 8 shots per class, highlighting the high discriminative power of the synthetic captions in data-constrained environments.

MLLMs generate captions for enhanced image fine-tuning performance

Scientists have developed a new method to enhance deep neural network fine-tuning by bridging the gap between multimodal pre-training and unimodal adaptation. The research addresses a fundamental limitation where pre-training increasingly utilises multiple data types, while fine-tuning often remains restricted to single modalities. Experiments revealed a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for improved fine-tuning. These captions are specifically designed to enrich the training process with multimodal objectives.

The team measured the performance of their method across 13 image classification benchmarks, demonstrating consistent improvements over baseline techniques. A carefully designed prompting strategy, incorporating class labels and domain context, was employed to produce high-quality captions tailored for classification tasks. Results demonstrate that the generated captions effectively augment the training data, leading to enhanced model performance, particularly in scenarios with limited training data. The work introduces a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, improving the model’s ability to discriminate between categories.

Furthermore, scientists introduced a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Tests prove that averaging embeddings from diverse captions provides a more robust and accurate representation for classification. Measurements confirm that this approach effectively combines information from different perspectives, leading to improved generalization and performance. The breakthrough delivers a new paradigm for dataset enhancement, effectively bridging the gap between multimodal pre-training and fine-tuning. Data shows the method’s effectiveness in generating synthetic datasets that improve classification accuracy. The research employed a prompt template structured as: “To differentiate this [class name] photo from other [domain] photos, describe its primary [characteristics] characteristics based on the photo in 50 words. ” Three distinct captions were generated per image, focusing on visual, shape, and texture characteristics, with each caption limited to 50 words to ensure compatibility with CLIP’s 77-token maximum context length. The study establishes a new framework for leveraging MLLMs to create richer training data and improve the performance of image classification models.

👉 More information
🗞 MultiModal Fine-tuning with Synthetic Captions
🧠 ArXiv: https://arxiv.org/abs/2601.21426

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Glacier Calving Front Delineation Achieves 68.7m Error Reduction with Domain Adaptation

Glacier Calving Front Delineation Achieves 68.7m Error Reduction with Domain Adaptation

February 3, 2026
Ultrafast Domain Wall Acceleration Achieves Magnon Speed in Graded Ferrimagnetic Materials

Ultrafast Domain Wall Acceleration Achieves Magnon Speed in Graded Ferrimagnetic Materials

February 3, 2026
Szegedy Quantum Walk Achieves -Partition Graph Community Detection with High Accuracy

Szegedy Quantum Walk Achieves -Partition Graph Community Detection with High Accuracy

February 2, 2026