Researchers are increasingly turning to Vision-Language Models (VLMs) to tackle the challenging problem of anomaly detection, offering a powerful alternative to traditional, data-hungry methods! Mohit Kakda, Mirudula Shri Muthukumaran, Uttapreksha Patel, and Lawrence Swaminathan Xavier Prince, all from Northeastern University, present a comprehensive analysis of VLM-based approaches for both anomaly classification and segmentation! Their work is significant because it systematically investigates how models like CLIP can identify defects with minimal labelled data, using natural language to define ‘normal’ and ‘abnormal’ , a crucial step towards more adaptable and efficient quality control systems! By evaluating feature extraction, alignment strategies, and prompt engineering, the team provides foundational insights into the strengths and limitations of VLMs in this domain, paving the way for improved performance and wider industrial application!

Researchers are. WinCLIP WinCLIP was the first work to successfully adapt CLIP for industrial anomaly detection! The core insight behind WinCLIP is that defects are often localized to small regions of an image, and a global image-level comparison might miss these subtle anomalies! To address this, the authors introduced a window-based approach that examines the image at a fine-grained level, combined with a carefully designed set of textual prompts that describe both normal and defective states! 3.1. Sliding-Window Feature Extraction Rather than processing the entire image as a single entity, WinCLIP divides it into overlapping windows at multiple scales!
For an input image I, this produces a set of windows: W = {w1, w2, ., wN}, wi ∈ Rh×w! Each window is passed independently through CLIP’s image encoder to obtain dense patch-level embeddings: vi = fCLIP(wi)! The multi-scale nature of this approach is particularly clever! By using window sizes like 2 × 2, 3 × 3, and 5 × 5, WinCLIP can detect both small localized defects (captured by smaller windows) and larger structural anomalies (captured by larger windows)! 3.1. What Works Well and What Doesn’t WinCLIP’s window-based approach has clear advantages!

Because each window contributes independently to the anomaly map, the method excels at localizing small defects! The compositional prompt ensemble also provides robustness by capturing different ways of describing the same concept! However, the method has some notable limitations! First, designing effective prompts requires domain knowledge about the specific objects being inspected! A prompt that works well for detecting scratches on metal surfaces might not transfer well to defects in textiles! Second, the sliding-window evaluation is computationally expensive, especially when using multiple scales! Finally, because each window is processed in isolation, WinCLIP can struggle with anomalies that require understanding global structure or context, for example, detecting that a component is missing from an assembly, or that parts are misaligned relative to each other! 3.

WinCLIP detects zero-defect anomalies via multi-scale analysis of

The research systematically investigated key architectural paradigms, including sliding window-based dense feature extraction, WinCLIP, and multi-stage feature alignment with learnable projections, known as the AprilLab framework. Experiments revealed that WinCLIP divides images into overlapping windows, extracting dense patch-level embeddings with resolutions of 2×2, 3×3, and 5×5 pixels, enabling the detection of both small, localised defects and larger structural anomalies. This multi-scale approach provides a detailed, spatially-aware understanding of the image under inspection! Results demonstrate that WinCLIP’s compositional prompt ensemble (CPE) constructs descriptions organised into normality and anomaly categories, expanding base prompts with template phrases to create richer descriptions for the CLIP text encoder.

Specifically, the model computes a similarity score, s i , for each window by taking the maximum cosine similarity between the window embedding and all prompt embeddings: s i = max j cos(v i , u j ). This allows for the generation of a pixel-level anomaly map, highlighting regions likely containing defects! Further analysis focused on AnomalyCLIP, which employs object-agnostic prompt learning and a diagonally prominent attention map (DPAM) to enhance spatial localisation. Tests prove that both WinCLIP and AnomalyCLIP demonstrate strong generalization across diverse anomaly types and product categories, but differ in complexity, computational cost, and detection accuracy.

The work details how WinCLIP’s sliding-window feature extraction, processing images into sets W = {w 1, w 2, …, w N} where w i ∈ R h×w , allows for independent contribution of each window to the anomaly map, excelling at localising small defects. Measurements confirm that the compositional prompt ensemble in WinCLIP provides robustness by utilising both normality prompts like “flawless [object]” and anomaly prompts such as “[object] with defect”. The research synthesises practical insights for method selection, identifying current limitations and facilitating informed adoption of VLM-based methods in industrial quality control.

VLMs for Unsupervised Anomaly Detection Insights are promising

However, WinCLIP retains advantages in truly zero-shot scenarios and provides explicit spatial reasoning for precise localisation, making it suitable for rapid prototyping with limited resources. The authors acknowledge that both methods struggle with flexible objects and complex assemblies, suggesting a need for further research in these areas! Future work could explore multi-scale feature fusion, temporal consistency for video analysis, domain adaptation to specialised applications, computational optimisation for real-time deployment, and explainability tools for human-in-the-loop quality control.

👉 More information
🗞 Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation
🧠 ArXiv: https://arxiv.org/abs/2601.13440

Tags:

anomaly classification Anomaly Detection anomaly segmentation CLIP MVTec AD prompt engineering sliding window-based feature extraction text-visual alignment VisA! Vision-Language Models

Vlm-based Approaches Achieve Zero-Defect Anomaly Classification and Segmentation

WinCLIP detects zero-defect anomalies via multi-scale analysis of

VLMs for Unsupervised Anomaly Detection Insights are promising

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks