Researchers developed a formal verification framework utilising the Lean 4 theorem prover to assess reasoning steps within large language models. This retrospective, step-aware approach articulates mathematical claims and provides formal proofs, demonstrably improving performance and offering verifiable evidence against hallucinations in model outputs.
The propensity of large language models (LLMs) to ‘hallucinate’ – generating plausible but incorrect responses – remains a significant obstacle to their reliable application, particularly in domains demanding precision such as mathematics. Researchers are now exploring methods beyond simply scoring LLM outputs, instead focusing on verifying the logical validity of each reasoning step. A team led by Chengwu Liu (Peking University) and including collaborators from Huawei Noah’s Ark Lab and The Hong Kong University of Science and Technology, detail a novel framework in their paper, Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification. The approach utilises the formal proof language Lean 4 to articulate and verify mathematical claims at each stage of an LLM’s reasoning process, offering a transparent and checkable audit trail, and a new benchmark for assessing step-by-step correctness.
New Dataset Enables Granular Evaluation of Automated Theorem Proving
A new dataset, FormalStep, comprising 43,652 individual reasoning steps formalised within the Lean 4 theorem prover, offers a more detailed method for evaluating automated theorem proving (ATP) systems. Current evaluation benchmarks typically assess the ability to prove complete theorems, overlooking the critical need to assess the correctness of each individual reasoning step.
FormalStep addresses this limitation by isolating single-step reasoning, allowing researchers to gain insights into how ATP systems reach conclusions, and facilitating targeted improvements. The dataset’s construction involved translating natural language statements into formal logical expressions using Lean 4, a dependent type theory system known for its verifiability. The project achieved a 72.2% formalisation rate, demonstrating the feasibility of converting human-like reasoning into precise mathematical statements.
To ensure the translation process did not alter the original meaning, semantic alignment assessments were conducted using GPT-4o-mini. These assessments confirmed that 80.9% of the formalised theorems accurately reflect the intent of the original natural language steps. An LSTM (Long Short-Term Memory) aggregator was employed to mitigate the impact of minor errors or inconsistencies during formalisation, enhancing the robustness of the evaluation.
Experiments utilising DeepSeek-Prover-V1.5 and COPRA (with GPT-4o) on the dataset revealed both systems can successfully prove single-step statements, given sufficient computational resources. DeepSeek-Prover-V1.5 demonstrated superior performance when computational resources were limited.
The development of FormalStep provides a novel dataset and a comprehensive evaluation framework encompassing auto-formalisation accuracy, semantic alignment, and theorem proving performance. Researchers highlight the importance of efficient algorithms and hardware acceleration for tackling complex reasoning tasks. Advances in ATP systems have implications for areas including software verification, automated theorem proving, and artificial intelligence.
The research team intends to expand the FormalStep dataset with new reasoning patterns and improve the evaluation framework. Future work will also investigate the use of machine learning to automate the translation of natural language into formal logic and assess the reasoning capabilities of different large language models.
👉 More information
🗞 Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04592
