The challenge of synthesizing knowledge from the rapidly expanding scientific literature is increasingly critical, yet valuable experimental data often remains locked within difficult-to-analyse formats. Kausik Hira, Mohd Zaki, and Mausam, along with their colleagues at the Indian Institute of Technology Delhi, address this problem with MatSKRAFT, a new computational framework that automatically extracts and integrates materials science knowledge from tabular data at an unprecedented scale. The team’s approach converts tables into graph-based representations, then uses these graphs to build models that incorporate fundamental scientific principles, achieving significantly higher accuracy and speed than existing methods. Applying MatSKRAFT to nearly 69,000 tables from over 47,000 publications generates a comprehensive database containing over 535,000 entries, including 104,000 compositions that expand existing materials knowledge and promise to accelerate data-driven materials discovery.

Synthesizing knowledge across vast literature remains challenging, as most experimental data resides in semi-structured formats that resist systematic extraction and analysis. This work presents MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at an unprecedented scale. The approach transforms tables into graph-based representations, processed by constraint-driven graph neural networks that encode scientific principles directly into the model architecture. MatSKRAFT significantly outperforms state-of-the-art large language models, achieving F1 scores of 88. 68 for property extraction and 71. 35 for composition extraction, while processing data 19 to 496times faster.

Scientific Table Data Extraction and Refinement

Scientists designed a system, DiSCoMaT, for extracting materials composition data from scientific tables, focusing on sophisticated post-processing and data augmentation techniques to improve accuracy and robustness. The system employs a modular pipeline that corrects errors, suppresses noise, and enforces physical plausibility in extracted data. Semantic validation disqualifies incorrect labels, resolves ambiguities, and corrects overloads, while direct matching uses a dictionary to map phrases to correct labels. Pattern and contextual reasoning disambiguates ambiguous headers, and value-aware correction rescales values based on header patterns.

Physical range validation removes values outside acceptable physical ranges, and the system removes invalid property-unit combinations. Ablation studies demonstrate that this post-processing improves the F1 score by 9. 38 points, significantly reducing false positives.

To address limitations of distant supervision, the team enhanced training data by re-annotating tables initially labeled as non-composition. The system determines table orientation, identifying whether data is arranged by column or row, and then identifies composition columns or rows based on header patterns, molecular weight units, and median value thresholds. Completeness validation checks if the sum of composition values in each row or column is close to 1 (or 100), indicating complete information, and the system promotes constituent identifiers in unmarked rows or columns. An edge list is generated for graph neural network training by pairing constituents with compositions.

This relabeling strategy improves composition extraction by over 10. 5 F1 points. In essence, the system goes beyond simple prediction by incorporating domain-specific knowledge, symbolic reasoning, and data augmentation to achieve high accuracy and robustness in materials composition data extraction.

Materials Knowledge Extracted From Vast Tabular Data

Scientists developed MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at an unprecedented scale, addressing a critical bottleneck in accessing and synthesizing scientific findings. The work demonstrates a systematic approach to processing vast quantities of data, constructing a comprehensive database containing over 535,000 entries from nearly 69,000 tables sourced from more than 47,000 research publications. This database includes 104,000 compositions, expanding coverage beyond existing databases and offering a richer resource for materials discovery. The framework employs specialized graph neural networks that encode scientific principles directly into the model architecture, enhancing accuracy and efficiency.

Property extraction achieved an F1 score of 88. 68, while composition extraction reached 71. 35, significantly outperforming existing methods.

MatSKRAFT utilizes two distinct graph neural network architectures for composition extraction, one for single-cell composition tables and another for multiple-cell and partial-information tables, adapting to the diverse formats found in scientific literature. The system’s knowledge base integration component links extracted compositions and properties through both intra-table and inter-table connections, creating coherent relationships from fragmented data. This integration process leverages orientation-based connections within tables and identifier-based association across tables, ensuring data consistency and accuracy. The resulting database enables diverse applications, including materials selection charts, multi-property screening, temporal analysis of research trends, and accelerated identification of rare materials with exceptional property combinations, establishing a new paradigm for systematic materials discovery and development.

MatSKRAFT Builds Extensive Materials Knowledge Base

This research presents MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at an unprecedented scale, significantly advancing the field of knowledge base construction. By transforming tables into graph-based representations, and employing constraint-driven graph neural networks, the system achieves superior performance in both property and composition extraction, exceeding the accuracy of existing methods while requiring modest computational resources. Application of MatSKRAFT to a large corpus of research publications resulted in a comprehensive database containing over 535,000 entries, including the identification of over 104,000 material compositions not currently found in existing databases.

The resulting knowledge base reveals previously overlooked materials and compositional relationships, offering a powerful tool for data-driven materials discovery and potentially accelerating innovation in areas such as energy storage and sustainable materials development. While acknowledging limitations in handling inconsistent reporting conventions and complex table structures, the team highlights the utility of the extracted data for applications including multi-property materials screening. Future work will focus on incorporating text-based composition extraction, extending the framework to encompass synthesis and characterization methods, and integrating it with predictive modeling tools, further expanding the potential for automated materials discovery.

👉 More information
🗞 MatSKRAFT: A framework for large-scale materials knowledge extraction from scientific tables
🧠 ArXiv: https://arxiv.org/abs/2509.10448

Tags:

composition extraction constraint-driven GNNs Data integration Graph Neural Networks Materials Discovery materials science property extraction tabular data

The Neuron

Matskraft Framework Achieves 88.68% F1 Score for Large-Scale Materials Knowledge Extraction from Tables

Scientific Table Data Extraction and Refinement

Materials Knowledge Extracted From Vast Tabular Data

MatSKRAFT Builds Extensive Materials Knowledge Base

Latest Posts by The Neuron:

Merck (NYSE:MRK) to Leverage Mayo Clinic Platform for AI & Precision Medicine Advances

NVIDIA Blackwell Ultra Achieves Up to 50x Performance Boost & 35x Cost Reduction for Agentic AI

Ant Group’s Ring-1T-2.5 1 Trillion Parameter Model Achieves Gold-Tier Performance on IMO 2025 & CMO 2025 Benchmarks