Scientists are increasingly interested in assessing whether artificial intelligence can replicate complex, expert-level scientific tasks. Miles Wang, Robi Lin, and Kat Hu, all from OpenAI, alongside Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan et al, present FrontierScience, a new benchmark designed to evaluate this capability. Unlike existing benchmarks often limited to multiple-choice questions or readily available data, FrontierScience challenges models with both international olympiad-level problems and open-ended, PhD-level research sub-tasks spanning physics and biology. This research is significant because it moves beyond simple knowledge recall to test a model’s ability to reason, problem-solve, and even contribute to the scientific process, utilising a detailed rubric-based evaluation framework to assess performance beyond final answers.
Unlike existing benchmarks often limited to multiple-choice questions or readily available data, FrontierScience challenges models with both international olympiad-level problems and open-ended, PhD-level research sub-tasks spanning physics and biology. This research is significant because it moves beyond simple knowledge recall to test a model’s ability to reason, problem-solve, and even contribute to the scientific process, utilising a detailed rubric-based evaluation framework to assess performance beyond final answers.
FrontierScience benchmark challenges advanced scientific reasoning capabilities
This comprehensive benchmark contains several hundred questions, including a publicly available gold set of 160, spanning diverse fields within physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. The team meticulously crafted these challenges to assess precise problem-solving skills in a constrained format. Conversely, the Research problems represent authentic research sub-tasks, authored and verified by PhD scientists, encompassing doctoral candidates, postdoctoral researchers, and professors, to reflect the complexities of ongoing scientific inquiry. The development of FrontierScience signifies a crucial step towards accurately measuring and forecasting the potential of AI to accelerate scientific discovery.
By pushing the boundaries of current Language models, this benchmark identifies areas where further advancements are needed to truly assist researchers in tackling complex scientific challenges. These experts, actively engaged in research at globally recognised institutions, ensured the originality and difficulty of each problem. Olympiad questions were designed to mimic the complex reasoning tasks found in international competitions, while Research problems were crafted to represent authentic sub-problems encountered during PhD-level research, typically requiring three to five hours for successful completion. This dual approach allows for a wider diagnostic of model strengths and weaknesses in expert-level scientific reasoning than previously available benchmarks.
Olympiad and PhD-level problem construction detailed analysis
To create the Research track, scientists authored PhD-level, open-ended problems mirroring sub-tasks encountered in active scientific research. Over 200 Research questions were written and verified by PhD scientists, encompassing diverse fields like quantum mechanics, biophysics, and organic chemistry. This framework moves beyond assessing only final answers, instead evaluating model capabilities throughout the entire problem-solving process using objectively assessable criteria totaling 10 points per question. The study employed a four-stage task creation process: Creation, Review, Resolution, and Revision, ensuring quality and adherence to guidelines.
Independent experts reviewed each other’s tasks to verify originality, difficulty, and verifiability. Problems were designed to be novel, even if inspired by existing scientific concepts, and preliminary questions were rigorously tested against internal OpenAI models; questions answered correctly by these models were either discarded or significantly modified. The team calibrated difficulty, aiming for a successful solution requiring 7-8 points on the rubric. Verification involved a multi-tiered review process, with each problem undergoing assessment by at least one, and often two, peer domain experts.
Disagreements between writers and reviewers were resolved through consensus or question removal. From an initial pool of over 500 Olympiad and 200 Research questions, a meta-review process yielded an open-sourced gold set of 100 Olympiad and 60 Research questions, with the remaining questions reserved for contamination tracking. This rigorous methodology enables a more nuanced and comprehensive assessment of scientific reasoning capabilities in language models than previously possible.
FrontierScience benchmark tests advanced scientific reasoning capabilities
FrontierScience encompasses several hundred questions, including a publicly available gold set of 160, covering diverse scientific subfields from electrodynamics to synthetic organic chemistry. The Research problems were authored and verified by PhD-level scientists, including doctoral candidates, postdoctoral researchers, and professors. Surprisingly, GPT-5 outperformed GPT-5.1 on the Research set and achieved parity with GPT-5.2 overall. Tests employed a model-based judge, GPT-5, with “high” reasoning effort to evaluate responses, averaging scores across 20 trials for Olympiad and 30 trials for Research, with a correctness threshold of earning at least seven rubric points.
Analysing the transcripts, researchers recorded that models commonly struggled with reasoning errors, failures in understanding specialised concepts, calculation mistakes, and factual inaccuracies. Measurements confirm that increasing test-time tokens improved GPT-5.2’s performance, raising its Olympiad score from 67.5% to 77.1% and its Research score from 18% to 25%. The breakthrough delivers a more robust evaluation of scientific reasoning, paving the way for further advancements in AI’s ability to contribute to complex scientific challenges and potentially accelerating scientific progress.
FrontierScience benchmark evaluates advanced scientific reasoning capabilities
This benchmark addresses limitations in existing evaluations by focusing on complex problems representative of both competitive scientific olympiads and authentic PhD-level research tasks. FrontierScience comprises two tracks: Olympiad, featuring challenging problems from physics, chemistry, and biology international olympiads, and Research, consisting of open-ended problems mirroring sub-tasks encountered in scientific research. Increasing the number of tokens used during testing improved performance on both tracks, particularly for the Olympiad set. However, the authors acknowledge several limitations of the benchmark.
The current format focuses on constrained problem-solving rather than the open-ended ideation crucial to scientific discovery. The rubric-based evaluation, while improved through verification, remains less objective than assessments based on definitive answers. Furthermore, the benchmark is limited to text-only problems, excluding the multimodal aspects often present in real scientific work. Future research should include human baselining to establish a comparative standard and explore the incorporation of diverse modalities, such as image and video analysis, to better reflect the complexities of scientific research.
👉 More information
🗞 FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks
🧠 ArXiv: https://arxiv.org/abs/2601.21165
