Anthropic’s Claude artificial intelligence has demonstrated a capacity for frontier theoretical physics, completing a complex research calculation in two weeks instead of the usual year for human physicists. Harvard professor Matthew Schwartz, a principal investigator in the NSF Institute for Artificial Intelligence and Fundamental Interactions, guided Claude Opus 4.5 through the entire process, relying solely on text prompts and never directly editing files. The project involved over 110 separate drafts, 36 million tokens, and more than 40 hours of local CPU compute, resulting in a technically rigorous paper. Schwartz believes this demonstrates a new capability for large language models. “This may be the most important paper I’ve ever written—not for the physics, but for the method,” he said, suggesting a fundamental shift in how theoretical research can be conducted, and that there is no going back.
AI-Guided Physics Research with Claude Opus 4.5
A new collaboration between human expertise and artificial intelligence is reshaping theoretical physics research, demonstrating a significant acceleration in research timelines. Professor Matthew Schwartz of Harvard University recently completed a high-energy theoretical physics paper in two weeks instead of the usual year by supervising the AI model Claude Opus 4.5 through a complete research calculation without directly editing any files. This experiment, detailed by Schwartz, wasn’t about replacing the physicist, but rather augmenting their capabilities with a powerful AI assistant capable of handling complex computations and code. The project involved guiding Claude through a problem at the level of a second-year graduate student: “resumming the Sudakov shoulder in the C-parameter.” Schwartz deliberately chose this well-defined problem, believing LLMs could handle coursework but needed assessment at this stage.
He explains, “I picked this problem because it connects directly to the foundations of our understanding of quantum theory,” emphasizing the importance of a technically rigorous challenge. Schwartz meticulously encapsulated the work, adhering to strict rules: only text prompts were used, and no direct file editing or pasting of pre-existing calculations was permitted. The key to success lay in organization and prompting strategy. Rather than a single, continuous dialogue, Schwartz instructed Claude to create a detailed plan, broken down into over 110 separate drafts across seven stages, each documented in separate markdown files. “This organization step was enormously helpful,” he notes, as it allowed Claude to retrieve information efficiently rather than relying on limited contextual memory.
IAIFI & Schwartz’s Decade of Machine Learning in Physics
Despite the success, Schwartz cautions that AI is not yet capable of fully autonomous scientific discovery. He found that while Claude demonstrated impressive capabilities, its output required careful evaluation by a domain expert to ensure accuracy. “AI is not doing end-to-end science yet,” he stated, “But this project proves that I could create a set of prompts that can get Claude to do frontier science.” This may be the most important paper I’ve ever written, not for the physics, but for the method. There is no going back. According to Schwartz, the key to this achievement was meticulous organization, structuring the project into over 110 separate drafts across seven stages and utilizing Claude Code to maintain a tree of markdown files, allowing the AI to retrieve information efficiently. He believes this approach represents a crucial intermediate step toward fully autonomous research, suggesting that LLMs may need a period of “graduate school” before tackling the most creative and open-ended problems in theoretical physics.
I think we can distill what is missing in current LLMs to a single word: Taste .
Sakana AI & the 2024-2025 Wave of AI Scientists
The surge in artificial intelligence capable of assisting, and potentially leading, scientific research gained considerable momentum throughout 2024 and 2025, with companies like Sakana AI spearheading the development of autonomous research systems. Sakana AI’s “AI Scientist,” released in August 2024, aimed to automate the entire research lifecycle, a bold ambition quickly echoed by competitors. February 2025 saw Google unveil its AI co-scientist built on Gemini, promising to accelerate hypothesis generation and evaluation, while the Allen Institute for AI (Ai2) launched the open-source Asta ecosystem, featuring tools like CodeScientist and AutoDiscovery to identify patterns within complex datasets. This rapid proliferation of AI research assistants, including FutureHouse’s Kosmos, the Autoscience Institute’s Carl, and the Simons Foundation’s Denario project, signals a fundamental shift in how scientific inquiry is conducted. However, early successes often relied on brute-force methods, as noted by researchers observing these systems.
Many approaches involved running numerous trials and designating the most favorable outcome as noteworthy, a tactic that, while yielding results, doesn’t necessarily represent genuine scientific advancement. “Maybe LLMs need to go to graduate school before advancing straight to the Ph.D.,” he suggests, advocating for a staged development of AI scientific capabilities. His own work, dating back to an early application of deep learning to particle physics in 2016 and a 2022 Nature Reviews Physics piece on AI and human evolution, has focused on pushing AI towards more complex symbolic manipulation. The process involved guiding the Claude Opus 4.5 model through the calculation, encapsulating all work within text prompts and avoiding direct file editing.
It feels a bit like Magnus Carlsen taking on five grandmasters in parallel.
LLM Capabilities: From Mathematics to Theoretical Physics
The accelerating progress in large language models (LLMs) is extending beyond mathematical problem-solving and into the complex domain of theoretical physics, offering a potential paradigm shift in how research is conducted. While earlier applications focused on data analysis, recent experiments demonstrate an emerging capability for LLMs to contribute to genuinely novel theoretical work, though not without significant oversight. Schwartz’s approach deliberately mirrored the progression of a graduate student, beginning with well-defined problems designed to build confidence and technique. A key element of Schwartz’s success was meticulous organization and a prompting strategy that prioritized structured output. The result, achieved in just two weeks instead of the usual year, was a technically rigorous paper. However, Schwartz emphasizes that domain expertise remains crucial.
Sudakov Shoulder Resummation as a Grad Student-Level Problem
Many assume artificial intelligence will soon independently conduct groundbreaking scientific research, but the reality is more nuanced; current AI capabilities are best suited to tackling well-defined problems mirroring the work of a beginning graduate student. He explains, “The physics is understood in principle; what’s missing is a careful, complete treatment.” The C-parameter describes the spray of debris from high-energy particle collisions, and accurately predicting its distribution is a crucial test of our understanding of fundamental physics. Standard approximations break down at a specific point, the Sudakov shoulder, necessitating a more refined approach. The goal wasn’t to solve a paradigm-shifting problem, but to demonstrate AI’s ability to handle a technically demanding calculation with a known solution, allowing for rigorous evaluation of its accuracy. He structured the project into over 110 separate drafts, utilizing a hierarchical system of markdown files to improve Claude’s organizational capabilities.
My conclusion is that current LLMs are at the G2 level.
40+ Hours Compute: Evaluating Claude’s Accuracy & Limitations
Schwartz deliberately selected a project mirroring the workload of a second-year graduate student, reasoning that “LLMs can already do all the coursework, so they are past the G1 stage.” He hypothesized that if an AI couldn’t reliably handle this level of problem-solving, progress towards truly autonomous research would be stalled. The process involved over 110 separate drafts, meticulously managed through a unique prompting strategy. The experiment revealed Claude to be impressively capable, yet not without flaws. While the AI demonstrated speed and tireless dedication, Schwartz found that domain expertise remained crucial for verifying the accuracy of its calculations.
We do not give enough credit to taste.
