Businesses are increasingly dependent on internal data, such as reports and records, to inform crucial decisions, but large language models often struggle to access and interpret this information effectively. Chandana Cheerla from IIT Roorkee, alongside co-authors, addresses this challenge with a new framework for Retrieval-Augmented Generation, designed to unlock the potential of structured and semi-structured enterprise data. The team’s approach combines advanced retrieval techniques, including semantic chunking and metadata filtering, with efficient indexing to deliver more accurate and comprehensive responses. Experiments on real-world enterprise datasets demonstrate significant improvements in both precision and recall, alongside notably higher scores for faithfulness, completeness, and relevance, suggesting this framework represents a substantial step forward in building intelligent systems capable of leveraging the wealth of information within organisations.
LLMs Adapt to Dynamic and Structured Data
Large Language Models (LLMs) now demonstrate significant capabilities in understanding and generating natural language, excelling at tasks like question answering, summarisation, and knowledge retrieval. However, these models are limited by static pretraining and restricted context windows, hindering their adaptability to dynamic or proprietary enterprise data. Critical information in domains like corporate governance, human resources, and finance is often found in structured records and tabular formats that LLMs cannot naturally process. Retrieval-Augmented Generation (RAG) frameworks address this limitation by integrating retrieval mechanisms with LLMs, enhancing response relevance and accuracy by grounding generation in up-to-date, domain-specific data.
Handling Mixed Data with Contextual Integrity
Methods optimised for unstructured text encounter challenges when applied to enterprise datasets containing structured, semi-structured, and tabular information. Key limitations include fragmented contextual representation, as standard chunking strategies disrupt meaningful contexts, particularly in complex documents. Inadequate handling of tabular data also arises because flattening tables into linear text destroys essential row-column relationships. Limited retrieval completeness and a lack of relevance reordering further restrict the model’s ability to balance semantic understanding with exact matching.
Enhanced Retrieval Improves Knowledge System Performance
Results demonstrate substantial improvements over baseline Retrieval-Augmented Generation (RAG) systems. Precision@5 increased to 90%, while Recall@5 improved to 87%, and Mean Reciprocal Rank (MRR) reached 0.85. These results underscore the robustness and applicability of the framework in real-world enterprise contexts, delivering superior retrieval accuracy, comprehensive responses, and higher contextual relevance. Researchers envision extending this framework towards agentic RAG systems, where intelligent agents autonomously select retrieval strategies, adaptively reformulate queries, and integrate multimodal data sources.
Robust Enterprise Knowledge Augmentation with RAG
Experimental results demonstrate that the proposed advanced RAG framework significantly outperforms both naive RAG and direct LLM prompting approaches. The combination of hybrid retrieval, semantic and structure-aware chunking, cross-encoder reranking, and dynamic query refinement enables effective handling of heterogeneous enterprise data, including complex tabular formats. Consistent improvements across Precision@5, Recall@5, and MRR, alongside higher human ratings for faithfulness, completeness, and relevance, affirm the robustness of the pipeline for real-world enterprise knowledge augmentation tasks.
The framework is capable of effectively working with a wide variety of data formats commonly found in enterprises, including unstructured text, structured documents, and tabular data. This versatility makes it practical for real-world scenarios. The hybrid retrieval approach, combining dense embeddings with sparse keyword-based methods, strikes a balance between semantic understanding and exact keyword matching, ensuring that the system retrieves information that is both contextually relevant and factually precise. The additional layer of cross-encoder reranking further refines the results, prioritising the most relevant content.
Another strength of the framework is its approach to tabular data. By implementing table-aware chunking and indexing each row individually, the system achieves a level of granularity that allows it to answer row-specific queries more effectively than standard text chunking. Additionally, the system includes dynamic query optimisation through LLM-based rewriting and expansion, enabling it to refine ambiguous or incomplete queries. For generation, a grounded prompting strategy ensures that LLM responses remain anchored in the retrieved evidence, with citations and summaries provided where necessary, enhancing credibility and mitigating hallucinations.
Limitations and future work include the reliance on static indexing, requiring full reindexing for data updates. A more dynamic, incremental indexing mechanism is needed for responsiveness. Handling highly complex or nested tables also presents a challenge, as preserving relationships in deeply structured tables can lead to partial loss of context. The current feedback mechanism depends on explicit user input, limiting the system’s capacity to learn and adapt automatically. Incorporating passive signals, such as user interaction with retrieved information, could help the system improve over time without direct input.
Future work will focus on implementing dynamic indexing capabilities, integrating advanced table understanding models, and exploring methods for leveraging implicit feedback signals. Researchers also plan to investigate intent-aware or interactive clarification methods for query handling and to extend the system to support multimodal data, broadening its applicability across various enterprise contexts.
👉 More information
🗞 Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
🧠 DOI: https://doi.org/10.48550/arXiv.2507.12425
