Researchers are tackling the critical issue of overconfidence in increasingly autonomous artificial intelligence systems. Jiaxin Zhang, Caiming Xiong, and Chien-Sheng Wu, all from Salesforce AI Research, demonstrate that current calibration techniques fall short when applied to complex, multi-step tasks where errors accumulate and failures are often difficult to predict. Their work introduces a novel diagnostic framework, Holistic Trajectory Calibration (HTC), which uniquely assesses confidence levels throughout an entire task ‘trajectory’ , from overall progress to minute stability , offering a significant leap forward in ensuring AI reliability. By not only improving calibration and discrimination across multiple benchmarks and language models, but also providing interpretability, transferability, and generalisation capabilities, this research establishes a new, process-centric approach to building trustworthy autonomous agents.
HTC extracts rich, process-level features from an agent’s trajectory, ranging from macro dynamics to micro stability, providing a comprehensive understanding of its decision-making process.
Powered by a simple, yet interpretable model, the framework surpasses strong baselines, offering not just improved accuracy but also crucial insights into why an agent succeeds or fails. This innovative approach moves away from treating confidence as a static property of a single output, instead recognising it as a compounding factor that accumulates throughout a sequential process. This study reveals three essential advances that distinguish HTC from existing methods: interpretability, transferability, and generalization. HTC provides interpretability by exposing the signals behind failure, such as early-step entropy and confidence gradients, enabling transparent diagnosis and guiding agent design.
Furthermore, the framework enables transferability by applying across domains without retraining, reducing the need for costly, task-specific tuning. Experiments show that HTC effectively addresses the data scarcity inherent in agent training, where each trajectory represents an expensive execution involving LLM inference, tool interactions, and human evaluation. By focusing on sample-efficient and interpretable methods, the researchers have created a process-centric paradigm for confidence calibration. This work doesn’t simply improve accuracy;. Results demonstrate that both HTC variants substantially outperform inference-based baselines, with particularly large gains in Brier Score and AUROC.
On the most challenging tasks, HTC-Reduced achieved an ECE of 0.031 and a Brier Score of 0.09 on the HLE dataset, highlighting the benefit of sparsity in isolating universal uncertainty signals. Detailed analysis, presented in accompanying radar charts, provides a comprehensive overview of the framework’s performance across all eight datasets. Tests prove HTC’s robustness in small-data regimes, consistently attaining lower mean error and dramatically smaller variance across dataset sizes ranging from 100 to 400, where neural baselines often overfit or fluctuate heavily. Scientists recorded consistent and substantial improvements with HTC across six different LLMs on the SimpleQA dataset, ranging from GPT-4.1 to GPT-OSS-20B, even correcting the specific deficiencies of each model.
The work confirms that HTC is model-agnostic and can serve as a plug-and-play module to enhance the reliability of various agentic systems, delivering significant gains on both lightweight smolagents and highly-optimized OAgents architectures using GPT-4.1 on GPQA. Measurements confirm that the most predictive signals of failure are task-dependent; the feature set and their relative importances shift based on the cognitive demands of the task. For SimpleQA, a “search-then-synthesize” task, the most predictive features were balanced across Dynamics, Stability, and Position, suggesting failure can occur at multiple stages. Conversely, for the complex reasoning task GPQA, feature importance was heavily concentrated in the Position category, indicating that the agent’s cognitive state at the beginning and end of the process is the most potent summary of the entire process. This diagnostic analysis reveals a general hierarchy of signals, providing deep insights into the nature of agentic failure and enabling a more interpretable approach to confidence calibration.
HTC reveals failure signals in LLMs through red-teaming
Scientists have demonstrated that advanced language models, transitioning towards autonomous systems, exhibit overconfidence even when failing at tasks. HTC uniquely extracts process-level features, spanning macro dynamics to micro stability, across entire task trajectories to assess reliability. Beyond improved performance, HTC offers interpretability by revealing the signals indicative of failure, and enables transferability by functioning effectively across different domains without retraining. A General Calibrator (GAC) component achieves state-of-the-art calibration on the out-of-domain GAIA benchmark, demonstrating strong generalization capabilities.
Analysis of feature importance reveals that for tasks like SimpleQA, failure signals are distributed across dynamics, stability, and position, while GPQA, a task demanding complex reasoning, heavily relies on the agent’s initial and final cognitive states. The study confirms that a hierarchical diagnostic approach is crucial, with positional features serving as primary failure indicators, complemented by stability and dynamics assessments throughout the process. While no single feature category is sufficient, combining them substantially improves performance, highlighting the value of integrating diverse diagnostic signals.
👉 More information
🗞 Agentic Confidence Calibration
🧠 ArXiv: https://arxiv.org/abs/2601.15778
