Researchers developed DeepTheorem, a framework and dataset of 121,000 informal mathematical theorems, to enhance large language model reasoning. A reinforcement learning strategy, RL-Zero, utilising verified theorem variants, demonstrably improves performance, achieving state-of-the-art accuracy and reasoning quality in informal theorem proving.
The capacity for artificial intelligence to engage in rigorous, multi-step logical deduction remains a significant challenge. Researchers are now focusing on informal theorem proving – the process of constructing mathematical arguments in natural language – as a means of assessing and enhancing the reasoning capabilities of large language models (LLMs). A collaborative team from Tencent and Shanghai Jiao Tong University, led by Ziyin Zhang, Jiahao Xu, and Zhiwei He, alongside Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhengwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Dong Yu, and Haitao Mi, present their work in the article ‘DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning’. They detail a new framework, DeepTheorem, incorporating a substantial dataset of 121,000 informally stated mathematical theorems and a reinforcement learning strategy designed to improve the robustness and accuracy of LLM-driven mathematical inference.
DeepTheorem: A Framework to Enhance LLM Capabilities in Informal Mathematical Proof
Automated theorem proving receives a notable development with the introduction of DeepTheorem, a new framework designed to improve the performance of large language models (LLMs) on informal mathematical proofs. Traditional automated theorem proving (ATP) systems typically rely on formal systems – rigorously defined logical languages and inference rules – which do not fully leverage the strengths of LLMs, which are trained on vast quantities of natural language text.
DeepTheorem centres on a newly constructed benchmark dataset comprising 121,000 high-quality theorems and proofs at the level of the International Mathematical Olympiad (IMO). Each theorem and proof is meticulously annotated with information regarding its correctness, difficulty, and mathematical topic. Crucially, the dataset also includes systematically generated verifiable variants of each theorem, allowing for more robust evaluation.
A core innovation within DeepTheorem is RL-Zero, a reinforcement learning strategy specifically designed for informal theorem proving. Unlike standard reinforcement learning approaches, RL-Zero utilises these systematically generated theorem variants to actively encourage sound mathematical inference within the LLM. This moves beyond simply verifying a given proof; the framework incentivises the model to develop robust reasoning processes. Reinforcement learning is a type of machine learning where an ‘agent’ learns to make decisions within an environment to maximise a reward.
Researchers also introduce a suite of comprehensive evaluation metrics. These metrics assess not only the correctness of generated proofs but also the quality of individual reasoning steps, moving beyond simple pass/fail criteria to provide a nuanced understanding of the model’s mathematical reasoning.
Extensive experimentation demonstrates that DeepTheorem significantly improves LLM performance on theorem proving tasks compared to existing datasets and supervised fine-tuning methods. The framework achieves state-of-the-art accuracy and exhibits a marked improvement in the quality of reasoning displayed by the models. These findings suggest that DeepTheorem has the potential to substantially advance automated informal theorem proving and facilitate mathematical exploration.
👉 More information
🗞 DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
🧠 DOI: https://doi.org/10.48550/arXiv.2505.23754
