A new multi-modal data analytics system improves accuracy and efficiency when analysing diverse data lake content. Researchers developed an architecture utilising the Model Context Protocol (MCP) to facilitate collaboration between large language models (LLMs) and specialised agents. An AI-powered translator converts user requests into precise analytical operations, executed by foundation models optimised for specific data types. A data updating mechanism, employing machine unlearning, balances data freshness with computational cost, addressing limitations in current LLM-based analytics systems.
The increasing volume and diversity of data stored in data lakes – encompassing structured databases, semi-structured documents, and unstructured text and images – pose considerable analytical challenges. Current approaches utilising large language models (LLMs) often struggle with both the precision of interpreting user queries and the computational cost of processing multiple data types. Researchers at Renmin University of China – Chao Zhang, Shaolei Zhang, Quehuan Liu, Sibei Chen, Tong Li, and Ju Fan – detail a new system, ‘TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes’, which addresses these limitations through a collaborative architecture. Their work proposes leveraging the Model Context Protocol (MCP) to distribute analytical tasks across specialised foundation models, improving both accuracy and efficiency while incorporating mechanisms to maintain data freshness and relevance.
Advanced Data Analytics with Modular Large Language Models
Modern data analytics increasingly encounters challenges integrating diverse data types within complex data lake environments. Traditional methods and initial applications of Large Language Models (LLMs) often struggle with the heterogeneity and scale of these systems. TAIJI presents a functional system designed to address these limitations, integrating LLMs with varied data sources and employing the Model Context Protocol (MCP) to facilitate modularity and scalability.
Data lakes, repositories storing data in its native format, commonly contain structured (e.g., relational databases), semi-structured (e.g., JSON, XML), and unstructured data (e.g., text, images, video). Analysing such diverse data requires systems capable of accurately interpreting user requests, efficiently processing different data types, and maintaining data currency. TAIJI’s architecture prioritises specialisation; individual foundation models – LLMs pre-trained on vast datasets – optimise performance for specific data modalities. This contrasts with relying on a single, generalised LLM, improving both accuracy and efficiency.
The system leverages the MCP to dynamically construct analytical pipelines. This protocol enables the selection and integration of specialised foundation models and external knowledge sources based on the specific analytical task. At its core is an AI-agent-powered translator that converts natural language queries into a semantic operator hierarchy. This moves beyond simple Natural Language to SQL conversions, capturing a more nuanced and comprehensive understanding of user analytical intent.
TAIJI actively manages data freshness by incorporating mechanisms to update both data lakes and LLM knowledge. This is achieved through research into machine unlearning techniques – methods for selectively removing information from a model – balancing the need for current information with inference efficiency. The system’s modular architecture supports high scalability, allowing for flexible integration of new tools and data sources.
A key feature of TAIJI is its iterative query planning and refinement process. The system translates natural language queries into executable plans and subsequently refines these plans based on initial results, demonstrably improving analytical outcomes. The AI-agent-powered translator effectively bridges the gap between user intent and analytical execution, utilising a defined semantic operator hierarchy tailored for multi-modal data. Demonstrated functionality includes accessing and processing data from multiple modalities within data lake environments, and mitigating issues arising from incomplete or outdated data by incorporating external, open-domain knowledge, ensuring more timely and relevant analytical results. By prioritising both accuracy and scalability, TAIJI offers a pathway towards more accurate, efficient, and timely insights from complex data lakes.
👉 More information
🗞 TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
🧠 DOI: https://doi.org/10.48550/arXiv.2505.11270
