Vtfusion Achieves 96.8% Accuracy in Few-Shot Anomaly Detection with Vision-Text Fusion

Anomaly detection in industrial settings, where identifying defects with limited examples of normal products is crucial, has driven new research into few-shot anomaly detection (FSAD). Yuxin Jiang, Yunkang Cao, and Yuqi Cheng, all from Huazhong University of Science and Technology, alongside Yiheng Zhang and Weiming Shen et al, present a novel approach called VTFusion to address limitations in current vision-text multimodal methods. Their work distinguishes itself by moving beyond reliance on general natural scene pre-training, instead focusing on domain-specific semantics vital for accurate industrial inspection. By introducing adaptive feature extractors and a dedicated multimodal prediction fusion module, VTFusion achieves state-of-the-art results , including 96.8% AUROC on the MVTec AD dataset and a remarkable 93.5% AUPRO on a challenging real-world automotive parts dataset , demonstrating a significant leap towards robust and practical anomaly detection in demanding industrial applications.

This breakthrough addresses critical limitations in existing methods, which often rely on features pre-trained on natural scenes and fail to capture the granular, domain-specific semantics crucial for accurate industrial inspection. The research team tackled the challenge of identifying irregularities using limited normal reference data by integrating both visual and textual information in a more robust and effective manner. VTFusion achieves this through two core designs, beginning with adaptive feature extractors for both image and text modalities, enabling the learning of task-specific representations and bridging the gap between pre-trained models and real-world industrial data.

Furthermore, the team augmented feature discriminability by generating diverse synthetic anomalies, enhancing the system’s ability to distinguish between normal and abnormal instances. A dedicated multimodal prediction fusion module forms the second key component of VTFusion, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps guided by multimodal input. Experiments demonstrate that VTFusion substantially advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. This represents a significant leap forward in anomaly detection accuracy compared to existing techniques.
The innovation doesn’t stop at benchmark datasets; VTFusion also achieves an AUPRO of 93.5% on a newly introduced real-world dataset of industrial automotive plastic parts, validating its practical applicability in demanding industrial scenarios. This achievement underscores the framework’s ability to perform reliably in complex, real-world conditions. By effectively aligning visual and textual features and mitigating cross-modal interference, VTFusion learns discriminative hybrid features and generates fine-grained anomaly detection results, offering a powerful tool for enhancing product quality control and automating industrial inspection processes. The work opens exciting possibilities for deploying advanced anomaly detection systems in a variety of industrial applications, improving efficiency and reducing defects.

VTFusion framework for few-shot anomaly detection offers promising

Scientists developed VTFusion, a novel vision-text multimodal fusion framework specifically for few-shot anomaly detection (FSAD), addressing limitations in existing methods reliant on pre-trained features from0.8% on the MVTec AD dataset and 86.2% on the VisA dataset, demonstrating a significant advancement in FSAD performance. Further validation involved a real-world dataset of industrial automotive plastic parts, where VTFusion attained an AUPRO of 93.5%, highlighting its practical applicability in demanding industrial environments. This innovative approach enables more robust anomaly detection by effectively integrating visual and textual information, surpassing the limitations of superficial concatenation strategies. Experiments revealed image-level AUROCs of 96.8% on the MVTec AD dataset in the 2-scenario, demonstrating a substantial improvement over existing methods. Furthermore, the team measured an AUROC of 86.2% on the VisA dataset, confirming the robustness of VTFusion across diverse datasets. Data shows that VTFusion achieves an AUPRO of 93.5% on a real-world dataset comprised of industrial automotive plastic parts, highlighting its practical applicability in demanding industrial settings.

The breakthrough delivers adaptive feature extractors for both image and text modalities, effectively bridging the domain gap between pre-trained models and industrial data, this is further enhanced by generating diverse synthetic anomalies to improve feature discriminability. Researchers recorded that the adaptive CLIP encoders facilitate the transfer of dominant features from source to target domains, enriching features with domain-specific information. Measurements confirm that incorporating four distinct types of synthetic anomalies establishes more compact normal feature distributions, leading to more discriminative features and superior single-modal prediction results. The team meticulously measured the performance gains achieved through this enhanced feature extraction process, demonstrating a clear correlation between feature discriminability and anomaly detection accuracy.

VTFusion incorporates a dedicated multimodal prediction fusion module, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps. Tests prove that this module effectively mitigates cross-modal interference, enabling semantic alignment between visual and textual features. Scientists achieved precise anomaly localization through the pixel-level supervised segmentation network, generating detailed anomaly maps that pinpoint irregularities with high accuracy. The framework learns discriminative hybrid features by integrating dedicated fusion blocks, resulting in fine-grained anomaly detection results and improved overall performance.

Results demonstrate that VTFusion’s ability to combine commonsensical knowledge encoded in text with domain-specific visual cues provides a comprehensive understanding of anomaly patterns. The study synthesised diverse anomalies to establish more compact normal feature distributions and then obtain more discriminative features, improving single-modal prediction results. The framework’s adaptive CLIP mitigates the domain gap, while the multimodal prediction fusion module enhances the robustness of anomaly detection against cross-modal interference, this combination delivers a significant leap forward in FSAD technology.

VTFusion boosts industrial anomaly detection performance significantly

Scientists have developed VTFusion, a novel vision-text multimodal fusion framework for few-shot anomaly detection (FSAD) in industrial settings. This research addresses limitations in existing methods that rely on pre-trained models unsuitable for specific industrial data and employ simplistic fusion strategies. VTFusion introduces adaptive feature extractors for both image and text data, alongside synthetic anomaly generation, to create robust, domain-specific representations. The framework further incorporates a dedicated multimodal prediction fusion module, utilising self-attention to selectively amplify discriminative features and refine pixel-level anomaly maps.

Experiments on the MVTec AD and VisA datasets demonstrate VTFusion’s superior performance, achieving image-level AUROCs of 96.8% and 86.2% respectively in the 2-scenario, and an AUPRO of 93.5% on a real-world automotive plastic parts dataset. The authors acknowledge limitations in handling intricate elements not present in pre-trained datasets and a potential struggle with general prompts for the text encoder. Future research will explore generalized multimodal fusion frameworks integrating attention mechanisms across 2D, 3D, and textual modalities, alongside advanced anomaly synthesis methods to improve data scarcity and generalisation.

👉 More information
🗞 VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection
🧠 ArXiv: https://arxiv.org/abs/2601.16381

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Universal Privacy Framework Achieves Untrusted Data Security in Distributed Quantum Sensing

Universal Privacy Framework Achieves Untrusted Data Security in Distributed Quantum Sensing

January 28, 2026
Graphene Josephson Junctions Achieve Tunable Coupling at Zero-Point Fluctuations Level

Graphene Josephson Junctions Achieve Tunable Coupling at Zero-Point Fluctuations Level

January 28, 2026
Hard Problem Demonstrates Limits to Optimal Weight in Quantum Codes

Hard Problem Demonstrates Limits to Optimal Weight in Quantum Codes

January 28, 2026