Researchers are tackling the persistent challenge of accurately retrieving fashion items from images, a crucial component of modern e-commerce. Chao Gao, Siqiao Xue, and Yimin Peng from Gensmo.ai, alongside Jiwen Fu, Tingyi Gu, Shanshan Li et al., introduce LookBench , a novel, continuously updated benchmark designed to reflect real-world fashion shopping scenarios. Unlike static datasets, LookBench combines current product images scraped from live websites with synthetically generated fashion, offering a holistic and challenging test for image retrieval models. This dynamic approach, complete with time-stamped data and declared training cutoffs, allows for contamination-aware evaluation and provides a durable measure of progress in the field, demonstrating significant shortcomings in existing models and paving the way for genuinely improved performance.
LookBench comprises four distinct evaluation subsets: RealStudioFlat, AIGen-Studio, RealStreetLook, and AIGen-StreetLook, each offering varying levels of difficulty and retrieval intent. RealStudioFlat focuses on clean, flat-lay product images for single-item retrieval, while AIGen-Studio extends this to AI-generated lifestyle studio images, increasing the complexity.
RealStreetLook and AIGen-StreetLook present the most challenging scenarios, involving real-world and AI-generated street-style outfits, demanding multi-item retrieval capabilities. Furthermore, the study introduces a comprehensive fashion attribute taxonomy, encompassing over 100 visually grounded properties, and leverages LLM-based annotation to provide reliable weak supervision for region-aware evaluation. Analysis of various vision-language and vision-only models reveals that generic open VLMs often underperform on LookBench, particularly with complex street-style outfits, while fashion-specific fine-tuning improves results but leaves room for further progress.
Live Fashion Image Retrieval Benchmark Construction
The study employed a unified, category-attribute-driven pipeline to construct four distinct retrieval subsets: RealStudioFlat, RealStreetLook, AIGen-Studio, and AIGen-StreetLook. Researchers began by sampling (category, attribute, year) tuples and instantiating structured templates to generate web image search queries or image-generation prompts, ensuring a diverse and representative dataset. For the real-image subsets, commercial image search engines were utilised to retrieve time-stamped studio packshots and street-style photos, which underwent rigorous de-duplication and filtering to remove low-resolution images, watermarks, and irrelevant content. The team then harnessed YOLOv11, a state-of-the-art object detection model, to localise fashion items within each retained image and obtain category-labeled crops, serving as visual queries.
Scientists enriched these query crops with fine-grained attributes using a pre-annotation pipeline, enabling precise and nuanced retrieval evaluations. To construct candidate galleries for each query, the research team retained only images sharing the same category and main attribute, ranking them based on the number of shared additional attributes with the query image. The top-k ranked results were then designated as positive matches, forming a robust and challenging evaluation set. This meticulous process ensures that LookBench accurately reflects the complexities of real-world fashion search, demanding sophisticated retrieval capabilities from participating models.
👉 More information
🗞 LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval
🧠 ArXiv: https://arxiv.org/abs/2601.14706
