Researchers are increasingly focused on optimising the computational cost of large language models (LLMs) during complex reasoning tasks. William Lugoloobi, Thomas Foster, and William Bankes, from the University of Oxford and University College London, alongside Chris Russell et al., demonstrate that LLMs internally encode information about their likely success before generating an answer. This research is significant because it reveals that pre-generation activations can be used to predict performance on mathematical and other tasks, exceeding the predictive power of traditional features like question length. By analysing model performance on the E2H-AMC benchmark, the team show that LLMs develop a unique sense of difficulty, differing from human perception, and crucially, that this internal assessment improves with more extensive reasoning. Ultimately, they prove that utilising these internal signals allows for intelligent query routing, achieving performance gains of up to 70% cost reduction on the MATH dataset.

Pre-generation activations reveal internal self-assessment of problem-solving potential in large language models

Researchers have discovered that large language models internally assess their own likelihood of success before generating outputs. This breakthrough addresses a critical challenge in the field: efficiently allocating computational resources to problems that genuinely require them. The study demonstrates that these internal assessments, encoded within the models’ pre-generation activations, can be accurately predicted using linear probes.

These probes substantially outperform traditional features like question length or term frequency-inverse document frequency in predicting performance on both mathematical and coding tasks. Leveraging the E2H-AMC benchmark, which uniquely provides both human and model performance data on identical problems, the work reveals a crucial distinction.

Models encode a notion of difficulty that diverges from human judgment, and this divergence becomes more pronounced as the complexity of reasoning increases. This suggests that models are not simply mimicking human problem-solving strategies, but are developing their own internal metrics for assessing challenge.

The research team trained linear probes on these pre-generation activations to predict success on tasks, achieving Spearman correlation coefficients of 0.83, 0.87 for human difficulty and 0.40, 0.64 for model difficulty. Furthermore, the researchers demonstrate the practical implications of this discovery.

By routing queries across a pool of models guided by these internal difficulty assessments, they achieved performance exceeding that of the best single model, while simultaneously reducing inference costs by up to 70% on the MATH dataset. This probe-guided routing also yielded similar gains on AIME and GSM8K, indicating the potential for significant efficiency improvements in real-world applications.

Binary classification of success, using fixed decoding policies, achieved an area under the receiver operating characteristic curve exceeding 0.7 for several models, demonstrating robust predictive power. The team’s code is publicly available, facilitating further research and development in this rapidly evolving field.

Quantifying model and human difficulty assessments using E2H-AMC data and linear probes

Linear probes trained on pre-generation activations were central to this work, enabling prediction of policy-specific success on both mathematics and coding tasks. These probes substantially outperformed traditional features such as question length and term frequency-inverse document frequency scores in assessing problem difficulty.

The research leveraged the E2H-AMC dataset, which uniquely provides both human Information Response Theory scores and model performance metrics on identical problems, allowing for a direct comparison of difficulty assessments. This facilitated the demonstration that models encode a notion of difficulty distinct from human perception, a divergence that intensifies with extended reasoning.

To quantify model difficulty, two primary targets were defined. The expected success rate, s(π, q), was calculated as the average success across K Monte Carlo rollouts, using K = 50 samples drawn from the decoding policy π for each question q. This provided a continuous ranking of questions based on anticipated model performance.

Linear probes were trained on pre-generation activations to predict both human IRT difficulty and model difficulty, using the defined success metrics. The study systematically investigated how test-time scaling, through techniques like majority voting and extended reasoning, affected the accessibility of difficulty information within these pre-generation activations.

Findings revealed that probe quality diminished with increased reasoning budget, decreasing from an AUROC of 0.78 to 0.64 despite improved accuracy gains from 86.6% to 92.0%. This has implications for adaptive inference systems reliant on pre-generation difficulty estimates.

Model difficulty prediction diverges from human assessment with reasoning depth

Linear probes trained on pre-generation activations achieve an Area Under the Receiver Operating Characteristic curve (AUROC) exceeding 0.7 for several large language models when predicting success on reasoning tasks. This supervised approach demonstrates stronger discrimination compared to previous unsupervised methods, which typically achieve AUROC scores of approximately 0.6 to 0.7 on mathematical reasoning benchmarks.

The research establishes a clear distinction between human and model-specific notions of difficulty using the E2H-AMC dataset, revealing that this divergence intensifies with extended reasoning. Specifically, the study demonstrates that models encode a difficulty signal distinct from human assessments, particularly as reasoning complexity increases.

Linear probes successfully predict both human Item Response Theory (IRT) difficulty and model difficulty, though these represent separate signals. Expected success rates, estimated via 50 Monte Carlo rollouts, provide a continuous measure of model-specific difficulty, ranking questions by anticipated performance.

Binary success prediction, using a fixed decoding policy, further enables direct application for routing purposes. Routing queries across a pool of models, guided by these probes, surpasses the performance of any single, best-performing model while reducing inference costs by up to 70% on the MATH dataset.

Probe quality, however, degrades with increased reasoning budget, decreasing AUROC from 0.78 to 0.64 despite improved accuracy increasing from 86.6% to 92.0%. This finding highlights the importance of reliable difficulty estimates for adaptive inference systems. The probe-based routing approach achieves cost savings ranging from 17% to 70% without requiring additional generation at routing time, identifying probe reliability as the primary limitation rather than routing sophistication.

Predicting language model performance via internal activation analysis and efficient query routing

Scientists have developed a method to estimate the likelihood of success for large language models before they begin processing a problem. This estimation is achieved by training simple linear probes on the models’ internal activations, effectively allowing prediction of performance on mathematical and reasoning tasks.

The probes outperform traditional metrics like question length and term frequency-inverse document frequency in assessing difficulty. Investigations reveal that language models possess a distinct notion of difficulty that diverges from human perceptions, and this difference becomes more pronounced with increased reasoning complexity.

By utilising these probes to route queries across a diverse pool of models, researchers demonstrated substantial reductions in inference costs, up to 70% on the MATH dataset, while maintaining or exceeding the performance of any single model. The system effectively selects cost-efficient models based on estimated success, favouring cheaper options for easier tasks and more capable models for challenging ones.

Limitations acknowledged by the researchers include the current reliance on linear probes at a single point in the processing sequence and the degradation of probe reliability as computational demands increase. Future work will focus on exploring non-linear probes, analysing activations at multiple stages of generation, and investigating the potential for transferring probes across different tasks and datasets.

Further refinement of routing policies, potentially incorporating adaptive strategies, may also improve performance and close the gap between current results and optimal oracle-level performance. These findings suggest that improving the accuracy of difficulty estimation is key to unlocking further efficiency gains in large language model inference.

👉 More information
🗞 LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
🧠 ArXiv: https://arxiv.org/abs/2602.09924

Tags:

Large Language Models

AI Can Now ‘know’ When It’s About to Fail before Answering Questions

Pre-generation activations reveal internal self-assessment of problem-solving potential in large language models

Quantifying model and human difficulty assessments using E2H-AMC data and linear probes

Model difficulty prediction diverges from human assessment with reasoning depth

Predicting language model performance via internal activation analysis and efficient query routing

Rohail T.

Latest Posts by Rohail T.:

Scalable Phonon Lasers Overcome Limitations for Focused Vibrational Control

Microstructure Predicts Qubit Coherence, Reducing Decoherence Loss by Two Orders of Magnitude

Fewer Atoms Needed: Light Emission Scales with One Divided by N Cubed