Anthropic is reporting on four key advances in how artificial intelligence is aiding scientific discovery, beginning with a new benchmark designed to rigorously assess AI’s bioinformatics capabilities. Called BioMysteryBench, the evaluation tasks Claude with analyzing real-world datasets, pushing beyond standard question-answering formats to mirror the complex workflows of actual scientific research. This shift reflects a broader move away from benchmarks like MMLU-Pro and GPQA, toward evaluations incorporating agent and tool-use, reading papers, coding, and even designing experiments. “Science is challenging, and so is evaluating it,” notes a researcher on the discovery team, highlighting the difficulty in establishing standardized tests for scientific competence, even for human experts. The team found that the latest generations of Claude not only perform on par with a panel of human experts, but sometimes solve problems using strategies they themselves could not.

In biology, there are many different “right” ways to do something. If there were only one right way to answer a research question, PhD students would earn their degrees in a matter of months, corporate R&D departments wouldn’t exist, and no science fair poster would need a “Methods” section.

Individual research decisions are highly subjective and can lead to entirely different conclusions in noisy datasets. Even within a chosen research direction, individual decisions can be highly subjective; one scientist may approve of a decision, while another researcher may have serious objections.

Just ask any frustrated author who’s gotten conflicting suggestions from a round of peer review! This difficulty is compounded by the fact that biological datasets are often noisy enough that small differences in research decisions can lead to entirely different conclusions about the data. In the decade-long search for metformin response predictors, slight differences in study design have led to entirely different conclusions about metformin response. A 2011 paper reported a variant that predicts metformin response that replicated in two cohorts, with a plausible mechanism involving AMPK activation.

In biology, there are many different “right” ways to do something If there were only one right way to answer a research question, PhD students would earn their degrees in a matter of months, corporate R&D departments wouldn’t exist, and no science fair poster would need a “Methods” section.

Brianna, a researcher focused on discovery, is currently spearheading efforts to evaluate Claude’s capabilities in bioinformatics using a novel benchmark called BioMysteryBench. This initiative arrives as assessment of large language models expands beyond traditional metrics like bar exam scores or Olympiad-level mathematics, instead focusing on specialized scientific domains. The development of BioMysteryBench signals a deliberate shift towards evaluating AI’s potential to tackle genuinely unsolved problems in biology, recognizing that the most impactful contributions may lie in areas where human expertise reaches its limits. There are many biological questions that humans cannot answer yet, and researchers are increasingly focused on identifying those very challenges as prime targets for artificial intelligence. Machine learning has already demonstrated success in areas where humans struggle, such as sequence prediction and protein modeling, largely by leveraging extensive experimental data rather than relying solely on expert intuition.

Benchmarks like ProteinGym and the long-running CASP competition exemplify this approach, grounding evaluations in experimental measurements that no human would attempt to replicate independently. However, these existing benchmarks often focus on narrow tasks and fail to capture the full scope of bioinformatics work. BioMysteryBench aims to address this gap by presenting models with messy, real-world data while maintaining rigorous evaluation standards. The benchmark tasks Claude with questions crafted by domain experts, each derived from a dataset with controlled, objective properties, rather than subjective scientific conclusions. This design allows for the creation of questions that, while verifiable, may not be readily solvable by humans. Claude is tasked with these questions within a container equipped with standard bioinformatics tools, the ability to install additional software, and access to essential databases like NCBI and Ensembl. A key feature of BioMysteryBench is its method-agnostic approach, granting Claude considerable freedom in selecting tools and strategies.

Evaluations are based solely on the final answer, rather than the analytical path taken, rewarding correct biological conclusions regardless of the method employed. The benchmark also includes a set of questions specifically designed to be difficult, or even impossible, for humans to solve. After rigorous quality control, 23 such questions remained, and current models solved many problems that a panel of human experts could not, sometimes using very different strategies.

Analyzing transcripts revealed two primary strategies employed by Claude: leveraging its vast knowledge base accumulated from hundreds of thousands of papers, and layering multiple methods when uncertain, combining different lines of evidence to reach a conclusion. “Often, this allowed Claude to solve human-unsolvable tasks!” the researchers noted, highlighting instances where Claude directly combined internal knowledge with live analysis to bypass the need for time-consuming meta-analyses or database stitching. While acknowledging the limitations of evaluating tasks that remain unsolved by both humans and models, the team emphasizes that the validation notebooks help ensure the signal exists within the data, even if discovering it proves exceptionally difficult. “So we ask both our models and our human benchmarkers not to be too frustrated if, a year from now, no one has solved the human-difficult set.”

Almost as soon as large language models could hold a conversation, people started asking how they’d stack up against human experts.

On the human-solvable set (left), all three models are strongly bimodal, problems are almost always solved either every time or never.

Recent benchmarking with BioMysteryBench reveals a striking pattern in the performance of leading language models like Claude; when presented with problems humans can solve, these models exhibit strongly bimodal behavior. This means a problem is typically solved consistently, across multiple attempts, or not at all, suggesting a clear distinction between retained knowledge and guesswork. This contrasts sharply with performance on more challenging tasks, where success becomes far more erratic. Researchers discovered this dichotomy while assessing Claude’s capabilities in bioinformatics, a field demanding specialized knowledge beyond the scope of general benchmarks like bar exams or Olympiad math. On the “human-solvable” set of problems, current models perform on par with human experts, and the latest generations solved many problems that a panel of human experts could not, sometimes using very different strategies.

This observation is not merely about an accuracy gap; it reveals a fundamental difference in how the models arrive at answers. Experts anticipate that this focus on reliability will become increasingly important as AI tools are integrated into real-world scientific workflows, where consistent, reproducible results are paramount. This pattern continues to be observed in newer generations of models. The team’s deeper dive into reliability, while described as “a little…boring,” underscores the importance of this metric in evaluating model performance. “It added some nuance to the performance analysis we showed above, but did not fundamentally tackle a new question,” the researcher noted. Despite this, the findings suggest that models are beginning to demonstrate a nascent “research taste,” hinting at the potential for more sophisticated scientific reasoning in the future.

Both are reasonable directions, and how you proceed will often just depend on expertise and resources.

Source: https://www.anthropic.com/research/Evaluating-Claude-For-Bioinformatics-With-BioMysteryBench

Tags:

benchmarks Bioinformatics Large Language Models MMLU-Pro

The Neuron

Anthropic Predicts 4 Ways AI Will Advance Scientific Discovery

On the human-solvable set (left), all three models are strongly bimodal, problems are almost always solved either every time or never.

Latest Posts by The Neuron:

OpenAI Adds 3GW Compute Capacity in Last 90 Days Alone

AWS Framework Automates Prompt Optimization Across LLM Families

University of Utah Secures $33M for AI and Computing Boost