Researchers have long sought to replicate the iterative process of human report writing with Deep Research Agents (DRAs), but a crucial aspect has remained largely unaddressed: can these agents reliably revise reports based on feedback? Bingsen Chen, Boyan Li, and Ping Nie, from their respective institutions, alongside Yuyu Zhang, Xi Ye, and Chen Zhao et al., now demonstrate a significant weakness in current DRAs , their unreliability in multi-turn report revision. Their new evaluation suite, Mr Dre, highlights that while agents can respond to direct feedback, they frequently regress on previously accurate content and citation quality, disrupting even well-established sections of the report , a finding that challenges the notion of DRAs as truly iterative research partners and suggests limitations in their ability to maintain coherence over extended writing processes.
This work introduces Mr Dre, a novel evaluation suite designed to assess multi-turn report revision, a crucial aspect of research report writing currently absent from existing DRA benchmarks. The research establishes that current DRAs struggle with reliably improving reports through iterative feedback, a process mirroring how human researchers draft and refine their work. Researchers developed a unified long-form report evaluation protocol, encompassing comprehensiveness, factuality, and presentation, alongside a human-verified feedback simulation pipeline to facilitate multi-turn revision testing.
Experiments conducted with five diverse DRAs reveal a concerning trend: agents regress on 16, 27% of previously covered content and exhibit declining citation quality even while successfully incorporating user-provided edits. The study unveils that even the best-performing agents fail to consistently preserve earlier revisions or avoid disrupting content unrelated to the feedback, leaving substantial room for improvement over multiple revision cycles. This regression isn’t easily rectified through simple inference-time adjustments, such as prompt engineering or employing a dedicated sub-agent specifically for report revision. The MR DRE evaluation suite unifies disparate evaluation metrics into a streamlined protocol, assessing reports across three key dimensions: comprehensiveness, factuality, and presentation.
To enable realistic multi-turn evaluation, the team constructed a human-verified feedback simulation pipeline, generating plausible user feedback on both content and formatting. Analysis using MR DRE demonstrates that, despite addressing over 90% of requested edits, DRAs introduce errors and inconsistencies, undermining the overall quality of the report. The research highlights a significant gap in current DRA capabilities, suggesting that substantial advancements in training methodologies and architectural design are needed to achieve reliable multi-turn report revision. This breakthrough establishes multi-turn report revision as a vital new evaluation axis for DRAs, moving beyond the limitations of single-shot report generation benchmarks. The findings underscore the importance of iterative refinement in research report writing and suggest that current agents struggle to balance incorporating new feedback with preserving the integrity of existing content. The core of Mr Dre comprises a unified evaluation protocol, meticulously assessing long-form reports across comprehensiveness, factuality, and presentation quality, critical dimensions for robust research outputs. Researchers then developed a human-verified feedback simulation pipeline, enabling realistic multi-turn revision scenarios for DRAs and facilitating rigorous comparative analysis.
To establish a baseline, the team analysed five diverse DRAs, subjecting them to the Mr Dre protocol and systematically evaluating their performance across multiple revision cycles. Experiments employed a controlled setup where agents received user feedback and were tasked with revising initial reports, with subsequent iterations assessed for improvements and regressions. The study revealed a significant limitation: despite addressing most user feedback effectively, agents exhibited a 16-27% regression rate on previously covered content and citation quality, a critical flaw impacting the reliability of generated reports. This regression was observed even in the best-performing agents, indicating a systemic challenge in maintaining consistency throughout the revision process.
Further investigation demonstrated that these issues weren’t easily resolved through simple inference-time fixes, such as employing a dedicated sub-agent solely for report revision. The research team meticulously tracked content disruption outside the feedback scope and the failure to preserve earlier edits, highlighting the complexity of maintaining coherence over multiple turns. Mr Dre comprises a unified evaluation protocol for long-form reports, assessing comprehensiveness, factuality, and presentation, alongside a human-verified feedback simulation pipeline for multi-turn revision scenarios. Analysis of five diverse DRAs using Mr Dre revealed a significant limitation: while agents can generally address direct user feedback, they exhibit regression in 16-27% of previously covered content and experience declines in citation quality.
Even the best-performing agents demonstrated considerable room for improvement over multiple revision turns, continuing to disrupt content unrelated to the feedback and failing to consistently preserve earlier edits. The authors acknowledge limitations including incomplete understanding of the causes behind these unreliabilities and the lack of investigation into the effects of model scaling on revision performance. Future research should focus on systematically analysing the underlying causes of these issues and exploring the impact of larger models, as well as enhancing the feedback simulation pipeline to account for varying checklist quality and incorporating length-aware evaluation metrics.
👉 More information
🗞 Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision
🧠 ArXiv: https://arxiv.org/abs/2601.13217
