Scientists are tackling the limitations of large language models (LLMs) in reasoning tasks, addressing their tendency towards ‘blind self-thinking’ even when information is incomplete. Xin Chen from Nanjing University, Feng Jiang from Shenzhen University of Advanced Technology, and Yiqian Zhang from the Chinese Academy of Sciences, alongside colleagues, present a new paradigm called Proactive Interactive Reasoning (PIR) that transforms LLMs into proactive inquirers. This research is significant because PIR enables models to interleave reasoning with direct clarification from the user, unlike existing methods which rely on external knowledge sources. Through uncertainty-aware fine-tuning and user-simulator-based policy optimisation, the team demonstrate substantial improvements in mathematical reasoning, code generation and document editing , up to 32.70% higher accuracy , alongside a reduction in computational cost and unnecessary interactions.
The research addresses a critical limitation of current reasoning LLMs, termed ‘blind self-thinking’, where models continue reasoning even when crucial information is missing or unclear, leading to inaccuracies and inefficiencies. Unlike existing approaches that rely on external knowledge sources, PIR directly engages with the user to resolve premise- and intent-level uncertainties, fundamentally altering how LLMs approach complex tasks. This innovative framework is implemented through two key components: an uncertainty-aware supervised fine-tuning procedure and a user-simulator-based policy optimization framework, both designed to equip models with interactive reasoning capabilities and align their behaviour with user intent.
The team achieved this breakthrough by first developing a mechanism to detect critical points within the reasoning process where the model’s confidence is low. At these junctures, the system injects clarification questions, effectively converting a standard reasoning trace into an interactive ‘think-and-ask’ format, with simulated user responses generated by instruction-following LLMs. Subsequently, a novel Group Relative Policy Optimization (US-GRPO) framework, driven by a dynamic user simulator, was introduced to further refine the model’s behaviour. This framework utilizes a composite reward system, balancing task success with helpfulness and efficiency, to encourage the model to prioritize clarifying ambiguities over continuing with potentially flawed reasoning.
The result is a system that not only solves problems more accurately but also minimizes unnecessary computational steps and conversational turns. Experiments conducted across mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baseline models. Specifically, the study reveals performance improvements of up to 32.70% in accuracy, 22.90% in pass rate, and 41.36 BLEU score, alongside a substantial reduction in reasoning computation, nearly halving the number of tokens processed, and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and scenarios involving missing premises confirm the strong generalization and robustness of PIR, indicating its potential for broader application beyond interactive settings.
This research establishes a new direction for reasoning LLMs, moving beyond passive problem-solving towards a more collaborative and efficient approach. The work opens possibilities for more intuitive and effective human-computer interaction, particularly in complex domains where ambiguity is common and accurate information is paramount. By proactively seeking clarification, PIR not only improves the reliability of LLM outputs but also reduces the burden on users, paving the way for more seamless and productive collaborations with artificial intelligence. The study pioneers a method where models actively seek information from users when facing ambiguity, rather than relying solely on internal reasoning or external knowledge retrieval. This work addresses the problem of ‘blind self-thinking’ in current LLMs, where models persist with reasoning despite lacking critical information. Researchers implemented PIR through two key components: uncertainty-aware supervised fine-tuning and a user-simulator-based policy optimization framework.
Initially, the team engineered an uncertainty-aware supervised fine-tuning procedure to equip the LLM with interactive reasoning capabilities. This involved augmenting training data with scenarios designed to elicit clarification requests, focusing on premise and intent-level uncertainties. The researchers then developed a reinforcement learning method, termed US-GRPO, incorporating a dynamic user simulator and a composite reward system. US-GRPO optimises interaction efficiency and aligns reasoning with user intent, utilising both extrinsic and intrinsic rewards to guide the model’s behaviour. The user simulator was designed to respond realistically to the model’s queries, providing feedback that mimics human interaction.
Experiments employed mathematical reasoning, code generation, and document editing tasks to evaluate PIR’s performance. The system delivers a 32.70% higher accuracy, a 22.90% higher pass rate, and a 41.36 BLEU improvement compared to strong baseline models. Crucially, the approach reduces reasoning computation by nearly half and minimises unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirmed the strong generalisation and robustness of PIR. The study also demonstrated that PIR achieves strong performance on non-interactive benchmarks, suggesting the benefits of proactive interactive reasoning extend beyond interactive settings. The research addresses a limitation of current reasoning-oriented LLMs, their tendency towards “blind self-thinking” even when information is incomplete or ambiguous. Unlike existing approaches that rely on external knowledge sources, PIR focuses on directly interacting with the user to clarify uncertainties at the premise and intent levels. This is achieved through uncertainty-aware supervised fine-tuning and a user-simulator-based policy optimization framework.
Experiments conducted on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baseline models. Specifically, the team measured up to 32.70% higher accuracy on the MATH-Chat dataset, a 22.90% higher pass rate on BigCodeBench-Chat, and a 41.36 BLEU score improvement on DocEdit-Chat. Crucially, PIR also reduces computational load, decreasing reasoning computation by nearly half and minimizing unnecessary interaction turns. Data shows that the model maintains low token usage, averaging between 1.3 and 1.7k tokens, and exhibits the smallest Turns to Resolution, indicating efficient clarification.
Further reliability evaluations across factual knowledge, question answering, and scenarios with missing premises confirm the strong generalization and robustness of PIR. The study recorded improvements on the MMLU benchmark, achieving a score of 62.51 with a token count of 0.77, and on TriviaQA, reaching 52.87 with 1.32 tokens. Measurements confirm that PIR enhances performance on the MIP-GSM8K dataset, with an accuracy of 35.93 and 0.83 tokens, and on MIP-MATH, achieving an EM score of 17.35 with 0.80 tokens. The researchers implemented an interactive environment using Llama-3.1-8B-Instruct as a user simulator, with a maximum of five interaction turns. Evaluation metrics included accuracy, pass rate, BLEU score, average token count, turns to resolution, and a helpfulness of asking metric based on the HLLM(r) reward. Results demonstrate that upgrading the user simulator to gpt-4o-mini improved accuracy from 32.70 to 34.00, while increasing token usage to 1.85k and slightly raising the turns to resolution to 2. These models often engage in extensive internal reasoning even when crucial information is lacking or unclear, a phenomenon termed “blind self-thinking”. PIR transforms LLMs into proactive inquirers, enabling them to interleave reasoning with requests for clarification from the user. This approach utilises uncertainty-aware supervised fine-tuning and a user-simulator-based policy optimisation framework to equip LLMs with interactive reasoning capabilities.
Experiments across mathematical reasoning, code generation, and document editing demonstrate that PIR consistently surpasses existing methods, achieving improvements of up to 32.70% in accuracy, 22.90% in pass rate, and 41.36 BLEU score. Importantly, PIR also reduces computational demands, decreasing reasoning length by approximately 2k tokens per task and halving unnecessary interaction turns. Reliability evaluations confirm the robustness and generalisability of PIR to factual knowledge, question answering, and scenarios with missing information. The authors acknowledge that the performance of the user simulator is a potential limitation, as real user responses may differ. Future research directions include exploring more sophisticated user simulators and investigating the application of PIR to more complex, real-world tasks. This work establishes a more efficient and user-aligned paradigm for reasoning LLMs, moving beyond passive problem-solving towards proactive inquiry and clarification.
👉 More information
🗞 Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
🧠 ArXiv: https://arxiv.org/abs/2601.22139
