The increasing use of Retrieval-Augmented Generation (RAG) systems allows large language models to draw upon diverse knowledge, including both text and images found in complex documents. Elias Lumer, Alex Cardenas, and Matt Melich, alongside colleagues at PricewaterhouseCoopers U. S., investigate how best to integrate visual information into these systems. Their research directly addresses a key limitation of current approaches, which rely on converting images into text summaries before processing, potentially losing crucial visual details. The team demonstrates that storing and retrieving images directly as multimodal embeddings significantly outperforms methods that rely on text-based summarization, achieving substantial improvements in both retrieval accuracy and the factual consistency of generated answers. This advancement preserves vital visual context, offering a pathway to more reliable and insightful information retrieval from complex sources.
Multimodal knowledge bases, containing both text and visual information such as charts, diagrams, and tables in financial documents, present significant challenges for information retrieval. Existing Retrieval-Augmented Generation (RAG) systems often rely on large language models to summarise images into text during preprocessing, subsequently storing only text representations in vector databases. This approach can cause loss of contextual information and visual details critical for accurate retrieval and question answering. To address this limitation, researchers conducted a comparative analysis of two retrieval approaches for multimodal RAG systems: text-based chunk retrieval, where images are summarised into text before embedding, and direct multimodal embedding retrieval, which stores images natively. This investigation aimed to determine which method more effectively preserves visual information and improves question answering accuracy in complex financial contexts.
Multimodal RAG Improves Financial Document Understanding
This research confirms that incorporating visual information alongside text significantly improves the performance of RAG systems, particularly when understanding complex financial documents. Large language models struggle to fully grasp the meaning of these documents without considering the visual elements. The study demonstrates that directly embedding and retrieving both text and images as a single vector representation is an effective approach, allowing the model to consider both modalities simultaneously. Financial documents, with their complex layouts, tables, charts, and specialised terminology, present a particularly challenging test case, highlighting the need for RAG systems specifically tailored to handle these complexities.
The findings demonstrate that multimodal RAG has the potential to significantly improve the accuracy and reliability of large language models in a wide range of applications. Tailoring RAG systems to specific domains, such as finance, medicine, or law, can further improve performance. In essence, this research demonstrates that incorporating visual information into RAG systems is essential for achieving state-of-the-art performance on complex document understanding tasks.
Multimodal Retrieval Surpasses Text for Finance Insights
Recent work demonstrates a significant breakthrough in Retrieval-Augmented Generation (RAG) systems, enabling large language models to effectively access and utilise multimodal knowledge bases containing both text and visual information, such as charts and diagrams found in financial documents. Researchers compared two retrieval approaches: text-based chunk retrieval, which summarises images into text before embedding, and direct multimodal embedding retrieval, which stores images natively in the vector space. Evaluations were performed on a newly created financial earnings call benchmark consisting of 40 question-answer pairs, each paired with two documents, one image and one text chunk. Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms text-based approaches, achieving a 13% absolute improvement in mean average precision and an 11% improvement in normalised discounted cumulative gain. These gains translate to relative improvements of 32% in mean average precision and approximately 20% in normalised discounted cumulative gain, indicating substantial gains in ranking and relevance when visual information is preserved in its original form. The study confirms that large language model summarisation introduces information loss during preprocessing, while direct multimodal embeddings preserve crucial visual context for both retrieval and inference.
Direct Visual Embedding Boosts Retrieval Performance
Recent research demonstrates a significant advancement in multimodal Retrieval-Augmented Generation (RAG) systems, which combine large language models with information retrieved from diverse sources. Scientists directly compared two approaches for incorporating visual data, such as charts and tables, into these systems. One method converts images into text summaries before retrieval, while the other stores and retrieves images directly as visual embeddings. Evaluations using a newly created financial earnings benchmark reveal that directly embedding and retrieving images substantially outperforms converting them to text, achieving a 32% relative improvement in retrieval accuracy. These findings indicate that preserving visual information in its original format, rather than summarising it as text, enables more accurate information retrieval and improves the quality of generated responses. As both embedding models and vision-language models continue to improve, the advantages of direct multimodal retrieval are expected to become even more pronounced.
👉 More information
🗞 Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
🧠 ArXiv: https://arxiv.org/abs/2511.16654
