Researchers are tackling the computational expense of enhancing Large Language Model (LLM) agents for software engineering (SWE) tasks via test-time scaling. Yifeng Ding and Lingming Zhang, both from the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign, alongside et al., present a novel approach called SWE-Replay to address this challenge. Unlike existing methods that rely on potentially unreliable value estimations, SWE-Replay efficiently scales by recycling successful trajectories from previous attempts, intelligently switching between exploring new solutions and exploiting archived experience at key decision points. This innovative technique, evaluated on benchmarks including SWE-Bench Verified, SWE-Bench Pro and Multilingual, demonstrates significant cost reductions , up to 17.4% , alongside performance improvements of up to 3.8%, marking a substantial advance in robust and efficient test-time scaling for modern software engineering agents.
The standard method of repeatedly generating solutions from scratch is computationally expensive, and recent attempts to reduce this cost using value agents have struggled with accuracy and generalisation to agents that create custom bash scripts. SWE-Replay addresses these limitations by efficiently recycling previously sampled trajectories, dynamically choosing between exploring entirely new solutions and exploiting existing experience by branching at critical points within prior attempts. This selection process prioritises the potential for repository exploration and the reasoning behind each step, rather than relying on potentially unreliable external evaluations from other LLMs.
The research establishes SWE-Replay as the first generalisable test-time scaling technique for modern agents, eliminating the need for potentially noisy value estimates. SWE-Replay optimises the scaling process by maintaining an archive of past trajectories and intelligently deciding whether to begin a new search or resume a previous one at a strategically chosen intermediate step. This branching mechanism, illustrated in Figure 1, allows the agent to bypass redundant computations and focus on promising areas of the codebase. By restoring the environment state and sampling a new step to continue exploration, SWE-Replay streamlines the process and ensures scalability for complex software engineering challenges.
Experiments show that, on the SWE-Bench Verified benchmark, SWE-Replay consistently outperforms naive scaling, reducing computational costs by up to 17.4% while simultaneously maintaining or improving performance by up to 3.8%. The team achieved these results across three different LLM backends and two agentic scaffolds, demonstrating the robustness of the approach. Analysis reveals that SWE-Replay effectively directs exploration towards less-visited areas of the repository, as visualised in Figure 6, indicating a more thorough search of the codebase. The work opens new avenues for optimising LLM agents in software engineering, offering a streamlined and efficient method for test-time scaling. The study addressed the computational expense of repeatedly sampling trajectories from scratch, a standard practice in test-time scaling, by pioneering a method that recycles prior trial data. Researchers implemented a dynamic branching system, enabling the agent to either explore new trajectories or exploit archived experience at critical intermediate steps within a task. This innovative approach circumvents the need for potentially inaccurate value estimates, a limitation of previous methods employing specialized value agents.
The core of SWE-Replay lies in its trajectory archive, which stores previously sampled trajectories for potential reuse. During operation, the system dynamically selects whether to initiate a new trajectory through stochastic sampling or to resume an existing one from a specific point within the archive. This selection process prioritises repository exploration potential and reasoning significance, rather than relying on external LLM-based quality assessments. Experiments employed the SWE-Bench Verified dataset to rigorously evaluate SWE-Replay’s performance against naive scaling methods, demonstrating a reduction in computational costs of up to 17.4% while maintaining or improving performance by up to 3. The experimental setup involved running the SWE-agent framework, equipping LLMs with tools such as terminals, editors, and search engines, to solve complex SWE tasks. Scientists meticulously tracked the number of trajectories sampled and the resulting performance metrics to quantify the efficiency gains achieved by SWE-Replay. The technique reveals that by intelligently reusing prior experience, SWE-Replay establishes a robust foundation for efficient test-time scaling, particularly for modern agents capable of synthesizing custom bash scripts.
The study pioneered a unique approach to trajectory selection, focusing on the potential for further repository exploration at each intermediate step. Researchers assessed this potential by analysing the reasoning significance of each step, effectively identifying points where resuming a prior trajectory could lead to more fruitful outcomes. This method contrasts with existing techniques that rely on external LLM-based quality estimates, which can be prone to miscalibration and incompatibility with modern agentic frameworks. The research introduces the first efficient and generalizable test-time scaling method for modern agents, circumventing reliance on potentially inaccurate value estimates. SWE-Replay optimises scaling by recycling trajectories from previous trials, strategically choosing to either explore from scratch or exploit archived experience by branching at crucial intermediate steps. This selection process is guided by the potential and reasoning significance of repository exploration, rather than relying on external LLM-based quality assessments.
Experiments on SWE-Bench Verified demonstrate that SWE-Replay consistently outperforms standard scaling methods, achieving cost reductions of up to 17.4% while maintaining or improving performance by up to 3.8%. The team measured cost reduction as a percentage decrease in trajectory sampling, directly correlating to computational savings. Performance was assessed using the resolve rate, which SWE-Replay improved by as much as 3.8% across various LLM backends and agentic scaffolds. Specifically, the study recorded consistent performance gains across different software issues, establishing the robustness of the technique. Analysis reveals that SWE-Replay effectively directs exploration towards the long-tail of repository files, indicating a more comprehensive search of the codebase. Visualisations, such as Figure 6 in the research, demonstrate this shift in exploration patterns. The core of SWE-Replay lies in its ability to reuse previously sampled trajectories by resuming exploration at carefully selected intermediate steps.
The algorithm maintains an archive of trajectories, updating it with new data, and iteratively decides whether to generate a new trajectory or exploit existing ones. Scientists measured the reasoning intensity of each step to identify critical points for branching, enabling the agent to revisit valuable regions of the search space. This approach bypasses the need for potentially inaccurate LLM-as-a-Judge evaluations and employs a streamlined select-and-replay mechanism to ensure scalability. The work presents a theoretical foundation linking replay optimisation to the quality of exploration within SWE-Replay.
SWE-Replay boosts efficiency and solution rates by streamlining
Researchers have developed SWE-Replay, a new framework for efficient test-time scaling of software engineering agents. This technique addresses the computational expense of repeatedly sampling trajectories by recycling previously successful attempts, dynamically choosing between exploring new solutions and exploiting existing experience. The system branches at critical points in the process, guided by the potential and reasoning significance of repository exploration, rather than relying on potentially unreliable external evaluations0.4% while simultaneously improving solution rates by up to 3.8% on SWE-Bench Verified. Furthermore, SWE-Replay exhibits consistent generalizability across diverse software engineering problems and agent architectures. The authors acknowledge that existing methods often struggle with miscalibration or are limited to specific agent designs, whereas SWE-Replay avoids these issues through its streamlined approach. Future work could explore extending the replay mechanism to even more complex scenarios or integrating it with other optimization techniques.
👉 More information
🗞 SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
🧠 ArXiv: https://arxiv.org/abs/2601.22129
