Cgpt Achieves Enhanced Table Retrieval Using LLM Supervision and K-Means Clustering

Researchers are tackling the persistent challenge of accurately retrieving information from tables, a task where conventional text-based embedding models often fall short due to the unique structure of tabular data. Tsung-Hsiang Chou, Chen-Jui Yu, and Shui-Hsiang Hsu, from National Chung Hsing University’s SMARTer centre, alongside Yao-Chung Fan, present CGPT , a novel framework that significantly improves table retrieval performance by cleverly combining clustering techniques with the power of large language models. CGPT constructs diverse partial tables and then utilises an LLM to generate synthetic queries, effectively using this data as supervision to refine the embedding model itself , a process previously overlooked by many approaches. Demonstrating consistent outperformance over existing methods, including a 16.54% average improvement in R@1 across four key benchmarks, CGPT offers a scalable and effective solution for large-scale table understanding and retrieval, even with smaller LLMs.

LLMs and K-means for table embedding learning

Scientists have achieved a significant breakthrough in table retrieval, addressing the limitations of current embedding models when dealing with highly structured data. Researchers introduced CGPT, a novel training framework that leverages large language models (LLMs) to enhance table representation learning through supervised learning. The core innovation lies in constructing semantically diverse partial tables by employing K-means clustering to group table instances, followed by strategic sampling across these clusters to maximise semantic coverage. An LLM then generates synthetic queries specifically tailored to these partial tables, which are subsequently used in a hard-negative contrastive fine-tuning process to refine the underlying embedding model.
Further investigation revealed CGPT’s strong cross-domain generalisation capabilities within a unified multi-domain corpus. Notably, the research demonstrates that CGPT remains remarkably effective even when utilising smaller LLMs for synthetic query generation, suggesting a scalable and cost-efficient solution for large-scale table retrieval applications. This adaptability is crucial for real-world deployment, as it reduces the computational demands associated with utilising extremely large language models. The work establishes that semantically guided partial-table construction, combined with contrastive training driven by LLM-generated supervision, represents a powerful and scalable paradigm for advancing the field of table retrieval.

Specifically, the CGPT framework consists of four key stages: clustering-based partial table generation, synthetic query generation, hard negative sampling, and contrastive fine-tuning. By adaptively determining the number of clusters based on table size, calculated as k = min(lm√m, kmax), the system ensures optimal semantic coverage. This meticulous process, coupled with the strategic use of LLM-generated queries for hard-negative contrastive learning, allows the embedding model to discern subtle differences between tables and retrieve the most relevant results with greater precision. The team has made their code publicly available at https://github. com/yumeow0122/CGPT, fostering further research and development in this rapidly evolving area.

Clustering and LLM-guided Partial Table Generation

Scientists developed CGPT, a novel training framework to enhance table retrieval performance through large language model (LLM) supervision. The study addressed limitations in existing methods by focusing on semantically diverse partial table construction and leveraging synthetic queries for direct embedding model refinement. Researchers initially employed K-means clustering to group table instances, effectively partitioning rows into semantically coherent subsets, this innovative approach broadened the coverage of table attributes and instances during partial table creation. Following clustering, the team sampled from each cluster to construct these partial tables, ensuring a more representative selection than methods relying solely on initial rows.

An LLM then generated synthetic queries specifically tailored to these partial tables, moving beyond heuristic selection strategies. These synthetic queries weren’t merely used as enhanced table representations, but crucially, served as direct supervision in a hard-negative contrastive fine-tuning process. This process refined the embedding model, enabling it to better retrieve tables even when relevant information was sparsely distributed across rows. Experiments were conducted across four public benchmarks, MimoTable, OTTQA, FetaQA, and E2E-WTQ, to rigorously evaluate CGPT’s performance against existing baselines, including QGpT.

CGPT boosts table retrieval accuracy

Further tests demonstrate that CGPT maintains strong cross-domain generalisation capabilities in a unified multi-domain corpus setting. Remarkably, the framework remains effective even when utilising smaller LLMs for synthetic query generation, indicating its scalability and cost-efficiency. Measurements confirm that the semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides a robust and scalable paradigm for large-scale table retrieval. Scientists recorded that the K-means clustering process effectively captures the table’s full semantic space, enabling the creation of representative partial tables.

The adaptive determination of the number of clusters (k) based on table size further optimises the process. Results indicate that this method surpasses previous approaches that relied on selecting only the first few rows of a table, which often failed to represent the complete information contained within. The breakthrough offers a promising pathway towards more effective information access in domains heavily reliant on structured data, such as finance, science, and logistics.

CGPT boosts table retrieval via LLM fine-tuning

Furthermore, CGPT exhibits strong performance in a unified, multi-domain corpus setting and maintains effectiveness even when utilising smaller LLMs for synthetic query generation. The authors acknowledge a limitation regarding cross-lingual robustness, noting that one selection method performed less effectively on English data, highlighting the importance of preserving semantic variation for reliable instance selection. Future research could explore methods to further enhance cross-lingual performance and investigate the application of CGPT to other structured data formats.

👉 More information
🗞 CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval
🧠 ArXiv: https://arxiv.org/abs/2601.15849

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

AI Learns to Compress Data Using Language Models for Perfect Reconstruction

Light-Matter Coupling Creates New Quasiparticles for Advanced Physics Exploration

February 17, 2026
AI Model Gains Agency over Its Own Memory, Managing Context Like a Human

Graphene Layers Exhibit Robust Quantum Effect Promising New Materials Platforms

February 17, 2026
Atoms and Molecules Combined Unlock Faster Quantum Entanglement Generation

Chemists Gain Simpler Route to Understanding Superconductivity’s Key Properties

February 17, 2026