Large language models exhibit limited capacity in complex mathematical reasoning requiring multi-step first-order logic deductions. Researchers present DREAM, a self-adaptive system improving performance on theorem proving from 4.2% to 6.4% by diversifying proof strategies and providing feedback on sub-propositions, utilising a new dataset of 447 Lean 4 theorems.
The capacity of large language models (LLMs) to perform complex logical reasoning remains a significant challenge, particularly when applied to multi-step mathematical proofs. Current LLMs, despite achieving reasonable results on standard benchmarks, often falter when required to navigate the intricacies of formal deduction. Researchers from Hong Kong University of Science and Technology, Peking University, and Zhejiang University, led by Chuxue Cao, Mengze Li, and Juntao Dai et al, address this limitation in their work, “Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving”. They present DREAM, a novel framework designed to improve the diversity and reliability of LLMs’ reasoning processes through axiom-driven strategy diversification and a feedback mechanism for correcting errors within proofs, and introduce a new dataset of 447 mathematical theorems expressed in the Lean 4 formal system to facilitate evaluation of these approaches.
Large language models (LLMs) increasingly demonstrate competence in first-order logic (FOL) reasoning, finding application across diverse fields, yet their capacity for complex mathematical reasoning, specifically tasks requiring multi-step FOL deductions, remains comparatively underexplored. Current LLMs, while achieving competitive results on established mathematical benchmarks, exhibit limitations when confronted with problems demanding extended chains of logical inference, as evidenced by Deepseek-Prover-V2-7B’s 4.2% accuracy on a newly proposed theorem proving dataset. This performance bottleneck arises from a restricted exploration of potential proof strategies and the susceptibility of LLMs to early errors that propagate throughout an entire proof attempt.
Researchers address these limitations with DREAM, a self-adaptive system designed to enhance both the diversity and reasonableness of LLM-generated proof strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism, actively promoting a wider range of strategic approaches during the proof search, while simultaneously, a Sub-Proposition Error Feedback component enables LLMs to critically evaluate intermediate steps and rectify errors before they compromise the overall proof. This dual approach mitigates the risk of early mistakes and encourages more robust reasoning, ultimately improving the reliability of complex mathematical derivations.
The implementation of DREAM achieves a measurable improvement in performance, increasing accuracy from 0.6% to 6.4% on the challenging theorem proving dataset. First-order logic (FOL), a formal system used in mathematics, philosophy, and computer science, represents statements and arguments using quantifiers (such as ‘for all’ or ‘there exists’), variables, and logical connectives (like ‘and’, ‘or’, ‘not’, and ‘implies’). The system allows for precise and unambiguous expression of mathematical concepts and facilitates automated reasoning.
The research also contributes a valuable resource to the community, providing a dataset of 447 mathematical theorems formally expressed in Lean 4. Lean 4 functions as a functional programming language and theorem prover, allowing for rigorous verification of mathematical proofs, ensuring their logical soundness and correctness.
Future work focuses on several key areas, including expanding the dataset to encompass a wider range of mathematical domains and theorem complexities, and investigating methods to improve the efficiency of the Sub-Proposition Error Feedback mechanism, potentially through reinforcement learning. Additionally, researchers plan to explore the integration of DREAM with other reasoning frameworks, such as symbolic computation systems, to unlock synergistic benefits, and investigate techniques for automatically generating diverse and challenging theorem proving datasets to drive continued progress in this field.
👉 More information
🗞 Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving
🧠 DOI: https://doi.org/10.48550/arXiv.2506.17104
