Researchers investigated the potential for large language models (LLMs) to enhance novice performance in practical biology, a critical consideration given their increasing proficiency on biological benchmarks. Shen Zhou Hong, Alex Kleinman, and Alyssa Mathiowetz, all from Active Site in Cambridge, MA, United States, led the study, working with colleagues including Adam Howes from Independent and further collaborators at Active Site, Cambridge, MA, United States. Their pre-registered, randomised controlled trial, conducted with 153 participants, assessed whether LLM assistance improved completion rates in tasks modelling a viral reverse genetics workflow. While the study revealed no significant overall improvement in workflow completion, it identified a numerically higher success rate in several individual tasks, particularly cell culture, and suggests a modest performance benefit with LLM support. These findings highlight a discrepancy between LLM performance on simulated benchmarks and their real-world application, emphasising the importance of robust validation for assessing the biosecurity implications of increasingly capable AI models.

Can readily available artificial intelligence actually help someone with no training perform complex laboratory work. Current language models do not markedly improve a novice’s ability to complete such procedures, according to new trials. However, assistance appears to offer a small advantage with certain steps, suggesting a potential role for carefully designed AI support in practical biology.

Scientists are increasingly attentive to the capabilities of large language models (LLMs) in biological domains, particularly concerning their potential to aid in the acquisition of dual-use laboratory skills. A critical question remains: does demonstrated proficiency on digital benchmarks actually translate into improved performance within a functioning laboratory setting.

Recent advances in LLMs have shown promise in areas like protocol development and biological knowledge retrieval, prompting concerns about potential misuse by individuals with limited expertise. Existing evaluations often rely on structured, digital environments that fail to capture the complexities of hands-on experimentation and the tacit knowledge required for successful lab work.

A pre-registered, investigator-blinded, randomized controlled trial involving 153 novice participants has begun to address this gap. Conducted between June and August 2025, the study assessed whether access to LLMs could enhance performance in a series of tasks designed to replicate a viral reverse genetics workflow. Initial observations revealed no overall difference in workflow completion rates between those using LLMs and those relying on standard internet resources.

A closer examination of individual task success rates indicated a trend toward improved performance in the LLM group, particularly in the demanding area of cell culture. The research team employed Bayesian modelling to estimate the potential impact of LLM assistance on a “typical” reverse genetics task, suggesting an approximate 1.4-fold increase in success.

Ordinal regression modelling indicated that participants utilising LLMs were more likely to progress through intermediate steps across all tasks, demonstrating a potential benefit beyond simply achieving a final outcome. These findings highlight a disconnect between performance on artificial benchmarks and real-world application, suggesting that thorough physical-world validation is essential for assessing the true biosecurity implications of increasingly powerful AI tools.

The investigation moved to a functioning biosafety level-2 laboratory where participants, all with minimal prior experience, independently attempted a series of five tasks. For eight weeks, these novices worked through a reverse genetics workflow, a complex procedure for creating a virus from its genetic sequence. At the study’s outset, participants underwent safety training and familiarization with the available tools, before being randomly assigned to either a control group with access to the internet or an intervention group equipped with access to leading LLMs from companies like Anthropic, OpenAI, and Google DeepMind.

Researchers aimed to isolate the impact of LLM assistance on task performance by restricting access to other resources. The study design went beyond simply measuring completion rates. Researchers carefully tracked attempts and timestamps for each procedural step, alongside detailed engagement data from LLM chat logs and internet searches. Participants also completed surveys to assess their motivation, perceived usefulness of the LLMs, and baseline demographics.

This multi-faceted approach allowed for a subtle evaluation of how LLMs were actually used and whether they contributed to improved performance at each stage of the workflow. The team also analysed the ability of participants to navigate intermediate steps, providing insights into the development of practical skills. Current biological benchmarks often prioritize factual knowledge and short-horizon tasks, and may not accurately reflect the challenges of real-world laboratory work.

The present study required participants to perform hands-on procedures, demanding somatic tacit knowledge and the ability to adapt to unforeseen circumstances, unlike these digital assessments. The focus shifted from evaluating what an LLM knows to understanding how it can uplift a novice user’s performance in a complex, long-term task. Data analysis revealed that while LLMs did not substantially increase overall workflow completion, they were associated with a modest but consistent performance benefit across multiple tasks.

For example, success rates in cell culture, a notoriously difficult technique, were numerically higher in the LLM group, reaching 68.8% compared to 55.3% in the internet-only group. Post-hoc Bayesian modelling estimated an approximate 1.4-fold increase in success for a typical reverse genetics task when LLM assistance was available, suggesting that LLMs can provide valuable support to novice researchers, potentially reducing errors and accelerating learning. The study’s findings highlight the need for rigorous, physical-world validation of AI biosecurity assessments as model capabilities and user proficiency continue to evolve.

Large language models show potential for incremental gains in complex biological workflows

Workflow completion rates revealed a 5.2% success rate for participants utilising large language models, contrasted with 6.6% for those relying on internet resources, a difference that was not statistically significant (P = 0.759). Analysis of individual tasks showed a trend towards improved performance in the LLM group, despite the primary endpoint not demonstrating a substantial benefit from LLM assistance.

Specifically, cell culture success reached 68.8% with LLM support, compared to 55.3% using internet resources, approaching statistical significance (P = 0.059). Further investigation using post-hoc Bayesian modelling estimated approximately a 1.4-fold increase (95% Credible Interval 0.74, 2.62) in success for a typical reverse genetics task when participants had access to LLMs.

Ordinal regression modelling provided additional insight into the impact of LLM assistance. Participants in the LLM arm exhibited a greater likelihood of progressing through intermediate steps across all tasks, with a posterior probability ranging from 95% to 99%.

Limited transfer of AI reasoning skills hinders laboratory workflow improvements

Scientists have long recognised the potential for artificial intelligence to reshape scientific work, yet translating that potential into tangible gains in the laboratory has proven surprisingly difficult. This latest research, evaluating large language models (LLMs) as assistants for novice biologists, reveals a familiar pattern: impressive performance on simulated tasks does not readily transfer to real-world improvement.

While LLMs didn’t deliver a dramatic boost to completing a complex viral engineering workflow, subtle gains in specific areas, such as cell culture, suggest a more subtle picture than simple failure. The observed trend towards better performance on intermediate steps is a valuable finding, hinting at where these tools might offer the most immediate benefit.

The gap between in-silico prowess and practical application remains a central challenge for AI in the life sciences. For years, benchmarks have shown AI systems mastering biological reasoning, but this work demonstrates that reasoning skills alone are insufficient to overcome the inherent difficulties of hands-on experimentation. Laboratory work demands physical dexterity, careful observation, and the ability to troubleshoot unexpected problems, qualities that current LLMs simply do not possess.

Attention must turn to understanding how to best integrate these tools into existing workflows, perhaps as sophisticated guides rather than autonomous agents. The modest improvements observed deserve further investigation. Researchers may achieve more substantial gains by refining the prompts and training data used by these models. A broader effort is needed to develop more realistic and thorough benchmarks that accurately reflect the complexities of laboratory work.

The focus should shift from simply demonstrating AI’s ability to answer biological questions to assessing its ability to support scientists in doing biology. By acknowledging the limitations of current AI and focusing on targeted applications, we can begin to unlock its true potential for accelerating scientific discovery.

👉 More information
🗞 Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology
🧠 ArXiv: https://arxiv.org/abs/2602.16703

Tags:

Bayesian modelling Large Language Models

Llms Show 69% Cell Culture Success for Novices

Large language models show potential for incremental gains in complex biological workflows

Limited transfer of AI reasoning skills hinders laboratory workflow improvements

Rohail T.

Latest Posts by Rohail T.:

Quantum Circuits Reveal Hidden Entanglement Changes with New Entropy Measures

Plant Light-Harvesting Boosted by Internal Electronic Mixing

Modulated Quantum Batteries Overcome Efficiency Losses from Energy Coherence