The challenge of reliably evaluating large language models remains a significant hurdle in artificial intelligence, as even sophisticated models produce noisy judgements when assessing other systems. Hamid Dadkhahi, Firas Trabelsi, and Parker Riley, alongside colleagues from Google and DeepMind, address this problem by investigating how computational resources dedicated to evaluation, known as inference-time compute, impact the accuracy of these judgements. Their work demonstrates that carefully allocating these resources and employing a novel aggregation scheme, based on statistical principles, transforms unreliable individual assessments into robust and dependable ratings. This distribution-calibrated approach not only improves performance across standard evaluation benchmarks, reducing errors and increasing accuracy, but also achieves results comparable to, and sometimes exceeding, those of human evaluators, representing a substantial advance in automated model assessment.

Calibrating LLM Outputs with Voting Distributions

This research details a method for improving the reliability of Large Language Model (LLM) outputs through a distribution calibration technique. The central idea is that LLMs, even with careful prompting, can exhibit biases in their output distributions, leading to inaccurate results. The method, known as BTD (Bias Towards Distribution), calibrates these distributions, aligning them with the true underlying distribution of the problem by analyzing multiple responses and optimizing aggregation to match expected correct answer distributions, specifically mitigating biases in the LLM’s voting distribution. Experiments demonstrate that BTD consistently improves accuracy across various tasks, including machine translation and reasoning problems, outperforming the standard Self-Consistency method.

Optimal performance is achieved with a moderate sampling temperature, balancing diversity and coherence, and further improvements are possible by balancing the order of candidates when prompting the LLM to reduce positional bias. The method works effectively with a moderate number of samples and benefits from the use of LLMs to refine writing and generate visualizations, with careful review by the researchers. The research is significant because it addresses the problem of unreliable LLM outputs and provides a practical method for improving their reliability. Distribution calibration is a powerful technique applicable to a wide range of LLM applications, offering robustness and generalizability across different tasks. Key takeaways include the importance of considering the distribution of LLM outputs, recognizing that simple averaging can be suboptimal, and understanding that calibration complements prompt engineering as orthogonal axes of improvement.

Three-Way Preference Modeling for LLM Evaluation

This study pioneers a novel distribution-calibrated aggregation scheme to improve the reliability of large language models (LLMs) when used as evaluators. Researchers recognized that standard aggregation techniques, such as majority voting, struggle with tied votes and fail to fully utilize information from multiple reasoning samples. To overcome these challenges, the team engineered a method that generates multiple independent ratings for each item, allowing for a more nuanced assessment. The core of the work involves modeling three-way preferences, positive, negative, and tie votes, using a Bradley-Terry-Davidson formulation.

This statistical framework explicitly accounts for both the margin of preference and the overall likelihood of ties, providing a more accurate representation of the evaluation distribution. Scientists calibrated the model parameters using maximum likelihood estimation on a small calibration set, enabling the system to align with the evaluation metric and avoid discrepancies between loss functions and performance measures. Experiments across diverse benchmarks, including machine translation and reward model assessment, consistently demonstrated that this distribution-calibrated approach significantly reduces errors and increases accuracy compared to existing methods. Notably, the method achieves performance matching or exceeding that of individual human raters when evaluated against human-consensus meta-labels, revealing instances where LLM judges approach “super-human” evaluation quality. This breakthrough demonstrates that carefully allocating compute and aggregating with distribution-aware methods transforms noisy model judgments into reliable ratings for evaluation.

Distribution-Calibrated Aggregation Improves LLM Evaluation Reliability

Scientists have developed a new distribution-calibrated aggregation scheme to improve the reliability of large language models (LLMs) when used as evaluators. The work addresses the inherent noise in single judgments by leveraging inference-time compute, generating multiple independent ratings for each item being evaluated. This research demonstrates that carefully allocating compute and aggregating results with a distribution-aware method transforms noisy individual judgments into reliable ratings. The team modeled three-way preferences, positive, negative, and tie votes, using a Bradley-Terry-Davidson formulation, which explicitly considers both the margin of preference and the likelihood of ties.

By estimating parameters on a small calibration set, the method aligns with the evaluation metric while leveraging a well-behaved probabilistic fit. Results show the approach considerably outperforms existing self-consistency methods, turning noisy model judgments into more reliable ratings. Notably, the new method matches or exceeds the performance of individual human raters when evaluated against human-consensus gold standards. In machine translation, the team adopted a consensus-based meta-evaluation to create higher-fidelity ground truths, revealing scenarios where LLM judges approach “super-human” evaluation quality. The breakthrough demonstrates that a principled aggregation approach is critical for effectively utilizing inference-time compute and achieving reliable evaluations with LLMs.

Language Models as Reliable Preference Judges

This research demonstrates that large language models can serve as reliable judges for pairwise preferences, provided that inference-time compute is carefully allocated and aggregation methods are distribution-aware. The team developed a novel approach that combines information about both the strength and decisiveness of judgements, using a Bradley-Terry-Davidson formulation to improve accuracy and reduce errors when compared to standard aggregation techniques. Results consistently show that this method achieves higher accuracy and matches or exceeds the performance of individual human raters when evaluated against human-consensus meta-labels. The study also investigated the impact of calibration set size, finding that performance plateaus after approximately 60 to 80 examples, suggesting an efficient range for training these judgement systems. Furthermore, the researchers identified distinct calibration regimes based on the interplay between the language model’s voting patterns and the ground truth, revealing that some tasks transfer more readily to others. While the current work focuses on tasks with clear preferences, the team acknowledges that generalizing this framework to broader ordinal and multi-class outcomes presents a promising direction for future research, and they plan to investigate methods for predicting task transferability and quantifying robustness under distribution shifts.

👉 More information
🗞 Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge
🧠 ArXiv: https://arxiv.org/abs/2512.03019

Tags:

Bradley-Terry-Davidson distribution-calibrated aggregation Evaluation Benchmarks inference-time compute Large Language Models meta-labels pairwise accuracy pairwise preferences Self-Consistency

Distribution-calibrated Inference Time Compute Improves Thinking LLM-as-a-Judge Pairwise Preference Accuracy

Calibrating LLM Outputs with Voting Distributions

Three-Way Preference Modeling for LLM Evaluation

Distribution-Calibrated Aggregation Improves LLM Evaluation Reliability

Language Models as Reliable Preference Judges

Rohail T.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy

Framework Improves Code Testing with Scenario Planning