Moleculariq Enables Fine-Grained Evaluation of Reasoning over Molecular Graphs

A molecule’s properties are fundamentally determined by its composition and structure encoded in its molecular graph. Consequently, reasoning about these properties demands the ability to parse and understand this underlying graph structure. Researchers Christoph Bartmann, Johannes Schimunek, and Mykyta Ielanskyi, alongside colleagues at the ELLIS Unit Linz and LIT AI Lab of Johannes Kepler University, Austria, present a new benchmark, MolecularIQ, designed to rigorously assess chemical reasoning capabilities. Unlike existing evaluations which often rely on potentially biased data or simplified question formats, MolecularIQ focuses exclusively on tasks with symbolically verifiable solutions, enabling a fine-grained analysis of how well current Large Language Models (LLMs) truly understand molecular structures , and where they fall short. This work offers actionable insights for developing LLMs that can reason faithfully over molecular structure, paving the way for more reliable and accurate chemical predictions and discoveries.

The team achieved this by creating a benchmark focused exclusively on tasks with symbolically verifiable answers, eliminating the risks of data leakage or bias present in many existing chemistry benchmarks. MolecularIQ enables a fine-grained evaluation of reasoning capabilities, pinpointing specific tasks and molecular structures where current LLMs falter, and providing actionable insights for future model development.

The study reveals that current chemistry benchmarks often measure factual recall or rely on datasets potentially present in LLM training data, hindering true assessment of structural reasoning. Consequently, the researchers constructed a benchmark where every answer can be computed directly from the molecular graph, ensuring test labels are algorithmically generated and free from memorisation or exploitation of dataset correlations. This approach establishes a clear baseline for structural competence, allowing for a more accurate evaluation of whether LLMs genuinely understand molecular structure rather than simply recognising patterns. The work opens new avenues for developing models that reason faithfully over molecular structure, crucial for advancements in areas like drug discovery and materials science.
MolecularIQ comprises three task types, feature counting and indexing, and constrained generation, and spans three complexity dimensions: multitask load, molecule complexity, and molecule representation. This multi-dimensional design allows for controlled, fine-grained diagnosis of model failures, complementing existing benchmarks with a targeted evaluation of internalised structural competence. Experiments show that the benchmark effectively localises failures to specific tasks and molecular structures, providing a detailed understanding of the strengths and limitations of current chemistry LLMs. The researchers provide a leaderboard and code repository to facilitate further research and collaboration within the field, fostering the development of more robust and reliable molecular reasoning models.

This research establishes a critical foundation for building LLMs capable of acting as decision-making engines in scientific workflows, proposing experiments, designing molecules, and interpreting results. By focusing on symbolically verifiable tasks, the study ensures that improvements in LLM performance reflect genuine progress in molecular understanding, rather than simply improved pattern recognition. The team’s approach promises to accelerate the development of LLMs that can contribute meaningfully to complex chemical challenges, lowering barriers for non-experts and enabling end-to-end automation of the scientific process.

Symbolic Verification of LLM Molecular Reasoning

Scientists developed MolecularIQ, a novel benchmark designed to rigorously evaluate molecular structure reasoning in large language models (LLMs). This work pioneers a symbolically verifiable approach, moving beyond general chemical knowledge assessments and addressing limitations found in existing benchmarks reliant on potentially biased literature or simplified multiple-choice questions. The core of MolecularIQ lies in its focus on tasks where answers can be definitively verified, enabling fine-grained evaluation of reasoning capabilities over molecular graphs and pinpointing specific failure points. Researchers integrated MolecularIQ into the lm-evaluation-harness framework, facilitating standardised evaluation of both locally run and API-accessed models.

Each of the 38 open-weight LLMs, encompassing 27 general-purpose and 11 chemistry-specialized models, underwent evaluation using tailored configurations, including model-specific sampling parameters, preprocessing functions, and extraction methods to ensure fair comparison and optimise performance. To address concerns about artificially inflated scores due to format compliance, the team engineered a hierarchical extraction procedure that accommodates diverse output formats. This system is coupled with key-specific matching and canonical normalisation, decoupling format from chemical accuracy and preserving property-level granularity for detailed analysis. Scoring was performed using a binary symbolic verifier, calculating per-instance accuracy as the mean across three independent stochastic generation rollouts; reported scores therefore range from 0 to 1.

For a subset of models, eight rollouts were performed, revealing negligible differences compared to the standard three, demonstrating the robustness of the measurement approach. The study reports both overall accuracy and granular breakdowns by reasoning task, multitask load, molecular complexity, molecular-feature category, and SMILES format, enabling systematic assessment of model strengths and weaknesses. A binary success metric was also employed, requiring at least two out of three rollouts to be correct to count an instance as successful. Furthermore, the team established a living leaderboard hosted on Hugging Face, accepting submissions of new chemistry LLMs with provided lm-evaluation-harness configurations and optional preprocessing/extraction functions. Results demonstrate that large, state-of-the-art LLMs with substantial reasoning budgets, such as GPT-OSS 120B (High), achieving an overall accuracy of 47.5 ±0.6, lead in molecular structure reasoning, with nine of the top ten models exhibiting similar high performance. This innovative methodology provides actionable insights into the capabilities and limitations of current chemistry LLMs, guiding the development of models that reason faithfully over molecular structure.

MolecularIQ benchmark assesses LLM molecular reasoning

Scientists have developed MolecularIQ, a novel benchmark designed to rigorously evaluate molecular structure reasoning in language models (LLMs). The research introduces a symbolically verifiable task suite, enabling fine-grained assessment of reasoning capabilities over molecular graphs and pinpointing failures related to specific tasks and molecular structures. Experiments revealed that the benchmark comprises 849 unique molecules drawn from a hard test pool, encompassing a total of 5,111 questions distributed across varying levels of complexity. These questions were meticulously constructed with precomputed ground-truth values for molecular features, ensuring accurate evaluation and preventing data leakage.

The team measured performance across four multitask regimes: single-task settings and scenarios with 2, 3, or 5 simultaneous requests, systematically varying the computational load. Molecule pools were created by retrieving single fragment molecules containing at least one carbon atom from PubChem, resulting in a training pool of 1.3 million molecules and 1.0 million molecules each in the easy and hard test sets. Clustering was performed using MinHashLSH over ECFP fingerprints (2, 512) with a 0.7 Tanimoto similarity threshold to ensure diversity and prevent redundancy. Results demonstrate that the benchmark effectively isolates the ability of models to manage multiple chemical sub-tasks, disentangling inherent task difficulty from broader reasoning limitations.

Tests prove that the top-performing model, TxGemma-27B, achieved an overall accuracy of 5.0 ±0.2% on the MolecularIQ benchmark. Ether0, a chemistry-specialized LLM, recorded an overall accuracy of 6.5 ±0.3%, with performance broken down as 3.2 ±0.3% for counting, 0.1 ±0.0% for indexing, and 17.5 ±0.8% for generation tasks. ChemDFM-R-14B, another specialized model, attained an overall accuracy of 8.7 ±0.4%, demonstrating 12.9 ±0.8% accuracy in counting, 2.8 ±0.4% in indexing, and 10.5 ±0.8% in generation. Measurements confirm that Qwen-3 Next 80B achieved the highest overall accuracy of 32.3 ±0.6%, with 30.7 ±1.0% for counting, 26.9 ±0.9% for indexing, and 40.0 ±1.1% for generation. The scoring system employs a binary symbolic verifier, calculating per-instance accuracy as the mean over three independent rollouts, ensuring robust and reliable evaluation. This breakthrough delivers a framework for dynamically adjusting and scaling the dataset, supporting the creation of validation sets and the expansion of MolecularIQ to address evolving model capabilities and specific subfields like natural products.

👉 More information
🗞 MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs
🧠 ArXiv: https://arxiv.org/abs/2601.15279

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Rbench Achieves 0.96 Fidelity for Robot Video Generation Benchmarking

Rbench Achieves 0.96 Fidelity for Robot Video Generation Benchmarking

January 26, 2026
Light Propagation Breakdown at High Density Limits Transfer-Matrix Method Accuracy

Light Propagation Breakdown at High Density Limits Transfer-Matrix Method Accuracy

January 26, 2026
Kerr-Enhanced Three-Wave Mixing Achieves Gain Beyond Subthreshold Amplifiers

Kerr-Enhanced Three-Wave Mixing Achieves Gain Beyond Subthreshold Amplifiers

January 26, 2026