Assessing the capabilities of large language models presents a significant challenge, particularly when evaluating complex reasoning and professional expertise, and current benchmarks often focus on narrow tasks. Zhilin Wang, Jaehun Jung, and Ximing Lu, along with colleagues at NVIDIA, address this limitation by introducing ProfBench, a new benchmark comprising over 7000 question and answer pairs judged by experts holding advanced degrees in fields like physics, chemistry, finance, and consulting. This work establishes a robust and affordable method for evaluating LLMs on tasks requiring professional knowledge, reducing evaluation costs dramatically while mitigating inherent biases, and revealing that even the most advanced models, such as GPT-5-high, struggle with these complex challenges, achieving only around 65. 9% accuracy. The findings highlight performance differences between proprietary and open-weight models and demonstrate the importance of extended reasoning for tackling professional-domain tasks, offering valuable insights into the ongoing development of increasingly capable artificial intelligence.
Realistic LLM Evaluation With ProfBench
This work details the creation and analysis of ProfBench, a challenging benchmark designed to rigorously evaluate large language models (LLMs). The benchmark moves beyond simple accuracy metrics to assess complex tasks requiring reasoning and knowledge application, addressing limitations in existing evaluations and providing a more nuanced understanding of LLM capabilities. Researchers also explored methods to optimize evaluation costs and reduce variance in performance estimates. ProfBench includes a diverse range of tasks, aiming for broad coverage of LLM capabilities and simulating real-world information retrieval scenarios.
The dataset is substantial, with each task potentially involving multiple documents, and the authors meticulously tracked costs associated with data collection, annotation, and LLM inference. Tasks were initially annotated by human experts to establish ground truth, and the authors then explored using LLMs as judges to reduce annotation costs, finding reasonable agreement between LLM judges and human graders. They investigated strategies to minimize evaluation costs, including dynamically allocating LLM generations per task based on task difficulty and variance, achieving reduced performance variance through this dynamic allocation scheme. Results demonstrate that LLM judges perform well when compared to human graders, suggesting their potential for cost-effective evaluation.
The authors demonstrated how to optimize evaluation costs by strategically allocating resources, and dynamic allocation of LLM generations significantly reduced performance variance, leading to more reliable results. OpenAI models consistently performed well, often appearing on the Pareto frontier, while Gemini-2 and Qwen3 models also showed strong performance. Finance and Physics tasks proved more challenging for LLMs than Consulting and Chemistry tasks. ProfBench highlights the importance of creating challenging and realistic benchmarks for LLM evaluation, and demonstrates that cost-effective evaluation is possible through LLM-based judging and dynamic resource allocation. Reducing performance variance is crucial for obtaining reliable and meaningful results.
Professional Task Performance Benchmark for Large Models
This study introduces ProfBench, a benchmark comprising over 7,347 response-criterion pairs, to rigorously evaluate large language models on professional-level tasks spanning Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA domains. Researchers recruited 38 annotators from eight countries, verifying their domain expertise and annotation task understanding, with a significant proportion holding PhDs and MBAs with substantial post-graduation work experience. Annotators dedicated considerable time to each task, crafting prompts designed to challenge even state-of-the-art models and mirroring complex requests a professional might delegate to a junior colleague, often resulting in multi-page reports. To ensure benchmark quality, the team implemented a multi-stage annotation process, beginning with prompt curation and progressing to rubric creation and response annotation.
Annotators developed between 15 and 60 criteria per task, each independently usable for scoring responses and collectively capturing response quality. Reviewers provided feedback and recommendations on each criterion, resulting in substantial improvements to the initial proposals. Responses were generated using three leading models, representing both proprietary and open-weight architectures, and subsequently scored against the established rubrics by the expert annotators. This rigorous methodology enables a nuanced assessment of LLM capabilities in complex, professional domains.
Expert Benchmark Reveals LLM Professional Task Limits
The research team introduces ProfBench, a new benchmark comprising over 7,000 response-criterion pairs, to rigorously evaluate large language models (LLMs) on professional-level tasks. This benchmark distinguishes itself by utilizing expert human annotators, holding PhDs in Chemistry and Physics, or MBAs in Finance and Consulting, to create both the tasks and the scoring rubrics, ensuring a high level of domain expertise. The work addresses a critical gap in LLM evaluation, moving beyond simple question-answering to assess performance on complex, multi-page report-style tasks mirroring real-world professional scenarios. Experiments reveal that even the most advanced LLMs, including a high-performing version of GPT-5, face significant challenges with ProfBench, achieving an overall performance of only 65.
9%. The team meticulously analyzed the performance of over 40 models, examining differences between open-weight and proprietary models, as well as the impact of model size and reasoning capabilities. Data shows a strong emphasis on reasoning within the rubrics, accounting for a significant proportion of the evaluation criteria, with information extraction also playing a key role. To ensure fair and accessible evaluation, the researchers developed LLM-Judges, designed to mitigate self-enhancement bias and dramatically reduce evaluation costs. These judges enable benchmark runs with minimal bias, at a significantly reduced cost compared to existing rubric-based evaluations. Annotators with substantial post-graduate experience were recruited from eight countries, ensuring a high standard of professional expertise in the creation of tasks and rubrics. The team deliberately curated tasks requiring multi-page reports, simulating the complexity of real-world professional assignments and pushing the boundaries of current LLM capabilities.
LLM Expertise Evaluated With ProfBench Benchmark
The research team has introduced ProfBench, a new benchmark designed to rigorously evaluate large language models (LLMs) on tasks requiring professional-level expertise. Unlike existing benchmarks focused on exam-style questions with concise answers, ProfBench assesses LLMs’ ability to process complex information and generate comprehensive reports across fields such as physics, chemistry, finance, and consulting. The benchmark consists of over 7,000 response-criterion pairs, each evaluated by human experts possessing relevant professional knowledge. A key achievement of this work is the development of LLM-Judges, a system for evaluating responses that significantly reduces both bias and evaluation costs.
This approach lowers the barriers to assessing LLM performance, making it more accessible to a wider range of researchers and developers. Results demonstrate that even state-of-the-art models, like a high-performing version of GPT-5, achieve only 65. 9% overall performance on ProfBench, highlighting the substantial challenges that remain in building LLMs capable of professional-level reasoning. The team also observed performance differences between proprietary and open-weight models and found that incorporating external search capabilities can improve results. The authors acknowledge that the retrieval of relevant documents for augmenting LLM.
👉 More information
🗞 ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
🧠 ArXiv: https://arxiv.org/abs/2510.18941
