Llms and RAG Enable 7% More Accurate Financial Question Answering with Domain Knowledge

Financial question answering, particularly tasks involving numerical reasoning, presents a significant challenge for even the most advanced artificial intelligence systems, often due to a lack of specialised financial knowledge. Yukun Zhang, Stefan Elbl Droguett, and Samyak Jain, all from Stanford University, tackle this problem by developing a new approach that combines information retrieval with large language models. Their work introduces a multi-retriever system, designed to access both relevant external financial data and the specific details of each question, and then leverages the power of these models to arrive at accurate answers. The team demonstrates that training the system with finance-specific data substantially improves performance, achieving state-of-the-art results and exceeding previous benchmarks by a considerable margin, while also shedding light on the delicate balance between factual accuracy and potential errors in these complex systems.

Financial numerical reasoning Question Answering (QA) tasks present challenges due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain difficult because they require specific domain knowledge and complex multi-step numeric calculations, often demanding the integration of information from multiple sources and an understanding of financial concepts, which current models frequently lack. Consequently, performance on these tasks lags behind other QA domains, highlighting a critical gap in the capabilities of existing LLMs.

Multi-Retriever RAG for Financial Question Answering

The study addresses challenges in financial numerical reasoning question answering by engineering a multi-retriever retrieval augmented generation (RAG) system. This system accesses both external financial knowledge and internal question context, leveraging a large language model for accurate answers. It incorporates two distinct retrieval mechanisms: an internal retriever, employing fine-tuned BERT-family models trained as binary classifiers to identify relevant supporting facts within the question, and an external retriever utilizing a Dense Passage Retrieval (DPR)-FAISS structure to extract definitions from a financial terminology dictionary. The top five and three sentences, ranked by logit score, were selected for use by the generator.

Two generator options were explored: a prompt-based LLM generator, leveraging Gemini Pro models via an API, and a symbolic neural generator. The symbolic neural generator formulates program steps using operation, constant, and step memory tokens, mirroring techniques from prior work, and generates each step sequentially using LSTM decoders and attention mechanisms. This innovative approach achieved a greater than 7% improvement over the previous state-of-the-art model on the FinQA benchmark, demonstrating the effectiveness of combining domain-specific knowledge retrieval with advanced language modeling.

Financial Reasoning with Multi-Retrieval Generation

Scientists achieved a significant breakthrough in financial numerical reasoning, developing a multi-retriever Retrieval Augmented Generation (RAG) system to address the challenges posed by a lack of domain knowledge in finance. The team implemented a system that retrieves relevant information from both internal question contexts and external financial knowledge sources, then utilizes a state-of-the-art LLM to generate answers. Experiments revealed that incorporating a SecBERT encoder, specifically trained for financial data, substantially improved the performance of their neural symbolic model. Detailed analysis highlighted a trade-off between reducing “hallucinations” and effectively utilizing external knowledge, with larger models demonstrating a greater ability to leverage external facts. The internal retriever was fine-tuned using BERT-family models as a binary classifier, extracting the most relevant supporting facts from question contexts, selecting the top 5 and top 3 sentences based on logit score rankings. The external retriever, built on the DPR-FAISS structure, efficiently extracts relevant definitions from a financial terminology dictionary, retrieving the top 3 related definitions for each query.

Financial Reasoning Enhanced by Multi-Retrieval Generation

This research presents a novel multi-retriever retrieval-augmented generation (RAG) system designed to improve performance on financial numerical reasoning question answering tasks, an area where large language models often struggle. By integrating both external domain knowledge and internal question context, the team developed a model that surpasses previous state-of-the-art results on the FinQA benchmark, achieving an improvement of over 7 percent. This success is largely attributed to domain-specific training using the SecBERT encoder, demonstrating the benefits of tailoring models to particular fields of expertise. Further investigation revealed a trade-off between the potential for models to generate inaccurate information, known as hallucinations, and the benefits of incorporating external knowledge, particularly in smaller models. However, the team found that for larger language models, the gains from accessing external facts generally outweigh the risk of hallucinations. Analysis of errors indicates that challenges remain in areas such as numerical unit conversions and accurate number retrieval, especially when questions require multiple steps to solve.

👉 More information
🗞 Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs
🧠 ArXiv: https://arxiv.org/abs/2512.23848

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Topology-aware Machine Learning Enables Better Graph Classification with 0.4 Gain

Llms Enable Strategic Computation Allocation with ROI-Reasoning for Tasks under Strict Global Constraints

January 10, 2026
Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

January 10, 2026
Deep Learning Control AcDeep Learning Control Achieves Safe, Reliable Robotization for Heavy-Duty Machineryhieves Safe, Reliable Robotization for Heavy-Duty Machinery

Generalist Robots Validated with Situation Calculus and STL Falsification for Diverse Operations

January 10, 2026