Large language models are increasingly relied upon for complex problem-solving, but their inherent limitations in planning and reasoning remain a significant challenge. Researchers Bernd Bohnet, Pierre-Alexandre Kamienny, and Hanie Sedghi, all from Google DeepMind, alongside Dilan Gorur, Pranjal Awasthi, and Aaron Parisi, have developed a novel approach to address this issue, focusing on enabling models to critique their own outputs. Their work demonstrates substantial performance gains on established planning benchmarks, including Blocksworld, Logistics, and Mini-grid datasets, even surpassing strong existing baselines. This research is particularly significant as it achieves these improvements through ‘intrinsic self-critique’ , a process of internal evaluation , without relying on external verification tools, offering a pathway to more robust and self-improving artificial intelligence systems. The team’s findings establish new state-of-the-art results for LLM checkpoints from October 2024 and highlight the potential for further advancements when applied to more sophisticated models and search techniques.
Doubt exists regarding the effectiveness of Large Language Models (LLMs) leveraging self-critique methods. This research demonstrates significant performance gains on planning datasets within the Blocksworld domain through intrinsic self-critique, achieved without reliance on external sources such as a verifier. Similar improvements were observed on the Logistics and Mini-grid datasets, exceeding the accuracies of strong baseline models. The researchers employ a few-shot learning technique, progressively extending it to a many-shot approach as a base method, and demonstrate that substantial improvement is possible through an iterative process of correction and refinement, significantly boosting planning performance.
LLM Self-Critique for Enhanced Planning Performance The research
The study pioneered a novel approach to enhance Large Language Model (LLM) planning capabilities through intrinsic self-critique, enabling models to evaluate and refine their own generated plans. Researchers engineered an iterative process where LLMs both generate plans and then critique those plans, leveraging in-context learning to incorporate self-generated feedback. This method distinguishes itself by operating without external verification sources, addressing limitations identified in earlier work that relied on external oracles for critique. The team successfully demonstrated performance gains across multiple planning datasets, including Blocksworld, Logistics, and Mini-grid, surpassing strong baseline accuracies.
Initially, the work employed a few-shot learning technique, providing the LLM with limited examples of plan generation and self-critique. This foundation was then progressively extended to a many-shot approach, significantly improving performance through iterative correction and refinement. Experiments utilized LLM model checkpoints from October 2024 as the basis for evaluation, establishing new state-of-the-art results. The researchers focused on demonstrating the intrinsic self-improvement capabilities of the method, highlighting its applicability across different LLM versions and potential for further gains with more complex search techniques.
The experimental setup involved testing the LLMs on planning problems of varying complexity, including Blocksworld scenarios with 3-5 and 3-7 blocks, alongside the Logistics and Mini-grid datasets. The self-critique mechanism was designed to provide correctness assessments and justifications for each generated plan, forming a feedback loop that drove iterative improvement. By aggregating previous plans and their critiques, the system created a contextual material collection used for subsequent plan generation cycles, effectively learning from its own mistakes. This innovative approach overcomes previous challenges with LLM self-evaluation, achieving lower false-positive rates and improved error detection.
The study’s success is particularly notable given earlier findings that questioned the effectiveness of LLM self-critique methods. Scientists harnessed the power of iterative refinement to demonstrate substantial performance gains, even in domains traditionally dominated by algorithmic planning. The research suggests that while a gap remains between LLMs and classic planners for highly complex tasks, this method significantly enhances LLM capabilities in areas like natural language planning, where classic planners often struggle. The team believes that applying this self-critique method to more capable models will unlock even greater performance improvements.
LLM Self-Critique Boosts Planning Performance Significantly
Scientists have demonstrated a novel approach to enhance the performance of Large Language Models (LLMs) through intrinsic self-critique, achieving significant improvements on established planning benchmarks. The research team successfully enabled LLMs to critique their own answers, leading to substantial gains in the Blocksworld domain without relying on external verification sources. Experiments revealed that this self-critique method, initially employing a few-shot learning technique, was then extended to a many-shot approach, further boosting performance on complex planning tasks. The study’s results demonstrate new state-of-the-art performance using LLM model checkpoints from October 2024, showcasing the method’s applicability across various model versions.
Data shows consistent improvements were also achieved on the Logistics and Mini-grid datasets, exceeding strong baseline accuracies previously recorded in these areas. The team measured a marked increase in planning success rates through iterative correction and refinement, highlighting the power of self-critique to significantly boost performance. This iterative process involved the LLM generating plans and then evaluating their own correctness, leveraging domain knowledge to identify and rectify errors. Measurements confirm that the developed method is not limited to simplified problems; while earlier LLM tests often focused on Blocksworld scenarios with 3-5 or 3-7 blocks, this work demonstrates effectiveness on more challenging tasks.
The breakthrough delivers an intrinsic self-improvement capability, allowing LLMs to refine their responses without external feedback or further training. Tests prove that by incorporating previous failures as contextual material in subsequent plan generation cycles, the LLM iteratively improves its outputs, leading to more accurate and effective plans. Researchers illustrate that this approach is particularly valuable for natural-language tasks like holiday planning or meeting scheduling, where classical planners often struggle due to the less structured nature of the input. The work presents a promising pathway for bridging the gap between LLM planning capabilities and those of traditional algorithmic planners, even in complex problem spaces. Scientists believe that applying this method to more sophisticated search techniques and more capable models will unlock even greater performance gains in the future.
Self-Critique Boosts LLM Planning Performance Significantly
This work demonstrates that large language models can significantly enhance their performance on standard planning benchmarks through intrinsic self-critique. Substantial gains were achieved across multiple datasets , Blocksworld, Logistics, and Mini-grid , with a new state-of-the-art result of 89.3% success rate on Blocksworld 3-5 when employing self-critique alongside self-consistency. Notably, this research represents the first demonstration of LLMs solving Mystery Blocksworld problems with 22% accuracy, improving to 37.8% with the implemented self-improvement techniques. The findings establish the viability of self-critique as a method for improving planning accuracy within language models, bridging a gap between symbolic planning and LLM capabilities. While the study focused on model checkpoints from October 2024, the authors acknowledge a limitation in the context length required for iterative critique, addressed by limiting iterations to ten steps. Future research could explore integrating this self-critique method with more sophisticated planning techniques, such as Chain-of-Thought or Monte-Carlo Tree Search, potentially unlocking even greater performance gains and enabling LLMs to tackle increasingly complex real-world planning scenarios.
👉 More information
🗞 Enhancing LLM Planning Capabilities through Intrinsic Self-Critique
🧠 ArXiv: https://arxiv.org/abs/2512.24103
