Composed Image Retrieval (CIR) presents a significant challenge to conventional methods, demanding a shift towards agentic reasoning that goes beyond simple similarity searches. Zhongyu Yang, Wei Pang, and Yingfang Yuan, all from Heriot-Watt University, alongside their colleagues, tackle this problem with a novel, training-free framework called XR. This research introduces a multi-agent system that reimagines retrieval as a coordinated reasoning process, employing imagination, similarity and question agents to progressively refine results and satisfy both semantic and visual requirements. By synthesising target representations, filtering with hybrid matching, and verifying factual consistency, XR demonstrably outperforms existing approaches , achieving up to a 38% improvement on benchmark datasets like FashionIQ, CIRR, and CIRCO , and highlights the crucial role of each agent type.
While embedding-based CIR methods have made strides, they often fall short by capturing limited cross-modal cues and lacking robust semantic reasoning. To overcome these limitations, researchers introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process.
This innovative framework orchestrates three specialized agent types: imagination agents synthesize target representations via cross-modal generation, similarity agents perform coarse filtering using hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to satisfy both semantic and visual query constraints, achieving up to a 38% performance gain over strong training-free and training-based baselines on the FashionIQ, CIRR, and CIRCO datasets. Detailed ablation studies confirm that each agent plays a crucial, indispensable role in the overall system’s success. The study unveils a novel approach to CIR, moving beyond simple content matching towards retrieving images that accurately preserve reference semantics while faithfully applying specified edits.
As illustrated in the research, existing methods fall into three categories, joint embedding, caption-to-image, and caption-to-caption, each with inherent limitations in capturing fine-grained correspondences or fully leveraging cross-modal evidence. XR addresses these shortcomings by explicitly exploiting cross-modal interactions, providing robust retrieval under heterogeneous signals and a more reliable alignment with user intent. Specifically, XR’s imagination stage constructs a target proxy by generating captions from cross-modal pairings, reducing modality gaps and anchoring target semantics. Coarse filtering employs similarity-based agents to evaluate candidates with multi-perspective scores, conditioned on cross-modal captions, while reciprocal rank fusion aggregates these scores for an initial ranked subset. Finally, question-based agents re-evaluate this subset through cross-modal factual verification, mimicking human validation processes and integrating verification scores with similarity scores for a refined final retrieval set. This design preserves diverse evidence sources, offering potential benefits for applications like personalized e-commerce search and multimodal recommendation.
XR framework utilising agentic image retrieval
Scientists developed XR, a training-free multi-agent framework to redefine retrieval through agentic AI and progressive reasoning. This work addresses limitations in composed image retrieval (CIR) where queries combine reference images with textual modifications, demanding compositional understanding across modalities. The team engineered a system that reframes retrieval as a coordinated process orchestrated by three specialised agents: imagination, similarity, and question agents. Initially, imagination agents synthesise target representations via cross-modal generation, effectively predicting what the modified image should look like based on the text prompt and original image.
Subsequently, similarity agents perform coarse filtering using hybrid matching, rapidly narrowing the search space to potentially relevant images. This hybrid matching leverages both visual and textual cues to identify candidates that broadly align with the query. Following this initial filtering, question agents verify factual consistency through targeted reasoning, providing a fine-grained assessment of how well each candidate image satisfies the textual modifications. This agent employs a targeted reasoning process to ensure the retrieved images accurately reflect the user’s intended edits, improving precision and relevance.
The study pioneered a progressive retrieval process, beginning with the imagination stage and culminating in coarse-to-fine filtering. Experiments employed the FashionIQ, CIRR, and CIRCO datasets to rigorously evaluate XR’s performance against strong baselines, both training-free and training-based. Results demonstrate a significant performance gain of up to 38% across these benchmarks, highlighting the effectiveness of the multi-agent coordination strategy. Ablation studies confirmed the essential role of each agent, demonstrating that the combined functionality is crucial for achieving optimal results.
This innovative methodology enables robust reasoning that better aligns retrieval results with user intent, surpassing the limitations of existing joint embedding, caption-to-image, and caption-to-caption approaches. The system delivers improved performance by fully exploiting cross-modal interactions and iteratively refining the retrieval process to meet both semantic and visual query constraints. Code is publicly available to facilitate further research and development in this rapidly evolving field.
XR agents refine image retrieval progressively, improving results
Scientists have developed XR, a training-free multi-agent framework redefining composed image retrieval (CIR) as a progressively coordinated reasoning process. The research addresses limitations in existing embedding-based CIR methods, which often struggle with limited cross-modal cues and a lack of semantic reasoning. Experiments revealed that XR orchestrates three specialized agents, imagination, similarity, and question, to iteratively refine retrieval and meet both semantic and visual query constraints. Similarity agents then perform coarse filtering via hybrid matching, utilising both visual and textual cues conditioned on cross-modal captions to produce multi-perspective scores. Reciprocal Rank Fusion (RRF) aggregates these scores, creating an initial ranked subset for further refinement.
Data shows that the question agents then re-evaluate this subset through cross-modal factual verification, employing predicate-style queries to mimic human validation of retrieval consistency. These agents test candidate images and captions, ensuring factual alignment with the query. Measurements confirm that integrating verification scores with similarity scores through re-ranking produces a final retrieval set benefiting from both efficient high-level retrieval and accurate factual validation. Ablation studies demonstrated that each agent plays an essential role in the overall performance of the system, highlighting the importance of coordinated multi-agent reasoning.
Tests prove that XR’s sequential modules, imagination, coarse filtering, and fine filtering, work synergistically to preserve diverse sources of evidence often overlooked by single-score pipelines. The framework constructs a target proxy by generating captions from cross-modal pairings, specifically modification text with both the reference image caption and the reference image itself. The authors acknowledge a trade-off between performance gains and computational cost. The findings demonstrate that cross-modal reasoning, achieved through the coordinated action of multiple agents, is crucial for effective composed image retrieval. This approach surpasses unimodal pipelines by integrating semantic alignment with factual verification, aligning retrieval results more closely with user intent. Future work could explore XR as a foundation for retrieval-augmented reasoning, enabling agentic systems to interpret and adapt across modalities for more reliable and human-aligned intelligence.
👉 More information
🗞 XR: Cross-Modal Agents for Composed Image Retrieval
🧠 ArXiv: https://arxiv.org/abs/2601.14245
