The challenge of maintaining consistent and relevant information during extended interactions with artificial intelligence is a significant hurdle in developing truly goal-oriented systems. Ruozhen Yang, Yucheng Jiang, and Yueqi Jiang, all from the University of Illinois Urbana-Champaign, alongside Priyanka Kargupta, Yunyi Zhang, and Jiawei Han, address this problem with a novel agentic memory system called STITCH (Structured Intent Tracking in Contextual History). Their research introduces a method for indexing information not just by content, but by ‘contextual intent’ , the underlying goal, action, and key entities , allowing the system to retrieve more accurate and relevant memories. This approach demonstrably improves performance on complex, dynamic tasks, exceeding current state-of-the-art results by 35.6% on the newly introduced CAME-Bench benchmark and LongMemEval, and represents a substantial step towards robust long-horizon reasoning in artificial intelligence. By filtering out irrelevant information based on contextual understanding, STITCH significantly reduces noise and enhances the reliability of memory retrieval.
Structured Intent Tracking for Long-Term Memory
Scientists demonstrate a significant advancement in agentic memory systems, tackling the challenges of long-horizon, goal-oriented interactions for large language models. This intent comprises the current latent goal, the action type being performed, and the salient entity types, enabling more accurate retrieval of relevant historical information. The research establishes a method for disambiguating repeated mentions and reducing interference in complex, dynamic environments where similar entities and facts recur under varying constraints.
The core innovation of STITCH lies in its ability to model the underlying intent of each step, grounding memory retrieval in a structured understanding of the task at hand. Researchers instantiate ‘contextual intent’ as a three-part cue: a thematic scope identifying the current goal, an event type describing the action, and key entity types defining relevant attributes. During operation, STITCH filters and prioritizes memory snippets based on compatibility with this intent, effectively suppressing semantically similar but contextually inappropriate historical data. This approach addresses limitations in existing agentic memory systems that struggle with linking non-adjacent segments and disambiguating repeated entity mentions.
To rigorously evaluate STITCH, the scientists introduce CAME-Bench, a new benchmark specifically designed for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Experiments conducted on both CAME-Bench and the LongMemEval benchmark reveal that STITCH achieves state-of-the-art performance, surpassing the strongest baseline by 35.6%. Notably, the performance gains of STITCH are most pronounced as the length of the trajectory increases, demonstrating its effectiveness in long-horizon reasoning scenarios. Analysis of the results shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning. This improvement is critical for applications requiring agents to track interleaved goals, resolve references, and coordinate actions over extended periods, such as complex human-agent dialogues, deep research workflows, and autonomous tool-augmented environments. Researchers engineered a system that indexes each step of a trajectory with a structured retrieval cue, contextual intent, and retrieves historical data by matching the current step’s intent. This contextual intent is comprised of three key signals: the current latent goal defining a thematic segment, the action type being performed, and the salient entity types relevant to the task. To construct this contextual intent, the team developed a method for inducing thematic scope, representing the overarching goal context.
A sliding window approach, coupled with a large language model (LLM) predictor, analyses each step alongside a recent history buffer and the previous scope to detect shifts in the latent goal. This scope remains consistent until a boundary event, indicating a change in goal-state, is identified by the LLM, at which point a new label is induced. To prevent information overload, a compressed summary of the current scope is maintained and used for subsequent predictions. Beyond thematic scope, the research also details a taxonomic event labeling technique. The system induces event types directly from the trajectory, avoiding reliance on a pre-defined ontology and allowing adaptation to diverse task domains.
A dynamic label-space evolution strategy begins with a seed vocabulary generated from initial trajectory steps, then retrieves semantically similar labels and prompts an LLM to select the best fit for each new step. Unfamiliar actions trigger the addition of new labels, while periodic refinement consolidates overlapping terms, maintaining a compact and distinct taxonomy. Experiments employed the newly introduced CAME-Bench benchmark, a multi-domain test for context-aware retrieval in long, goal-oriented trajectories, and LongMemEval to validate the approach. STITCH consistently outperformed strong baselines, achieving a performance increase of 35.6% and demonstrating the largest gains as trajectory length increased, confirming the effectiveness of intent-aware retrieval for robust long-horizon reasoning. The research team measured performance using CAME-Bench, a newly introduced benchmark for context-aware retrieval, and LongMemEval, demonstrating state-of-the-art results that surpass the strongest baseline by 35.6% overall. Experiments revealed that STITCH’s innovative approach to indexing trajectory steps with structured retrieval cues, contextual intent, significantly reduces retrieval noise and supports robust long-horizon reasoning capabilities. The core of STITCH lies in its ability to disambiguate repeated information by leveraging three key components of contextual intent: the current latent goal, the action type, and salient entity types.
Data shows that STITCH achieves a Macro-F1 score of 0.844 on the CAME-Bench Medium subset and an impressive 0.682 on the Large subset, while maintaining a score of 0.860 on LongMemEval. These measurements confirm that STITCH not only matches but surpasses existing methods, particularly as trajectory length increases, effectively addressing the “lost-in-the-middle” phenomenon often observed in long-context models. The team enforced a shared retrieval budget of 4,096 tokens for all methods to ensure a fair comparison. Further analysis involved a detailed ablation study, revealing the critical role of each contextual intent component.
Results demonstrate that the thematic scope component is the most significant contributor to performance gains, effectively reducing context noise by segmenting trajectories into behavior episodes. Removing the thematic scope reduced the Macro-F1 score on the CAME-Bench Large subset from 0.844 to 0.463, highlighting its importance. Taxonomic event labeling and coreference resolution also contributed to improved performance, with the coreference module grounding entities and preventing the retrieval of ambiguous context snippets. The study’s findings are presented in Table 1, which compares STITCH against various baseline methods across different subsets of CAME-Bench and LongMemEval. STITCH consistently outperforms alternatives, achieving a paired t-test p-value of less than 0.05 when compared to the strongest baseline within each subset. The breakthrough delivers a robust solution for intent-aware memory, enabling agents to maintain context and reason effectively over extended interactions, with potential applications in areas such as personal assistants, robotics, and complex task management.
Contextual Intent Improves Agentic Long-Term Memory
This research presents STITCH, a novel agentic memory system designed to improve performance in long-horizon, goal-oriented interactions. The system indexes each step within a trajectory with a structured retrieval cue, termed ‘contextual intent’, encompassing the latent goal, action type, and salient entity types. By matching current intent with historical data, STITCH effectively filters and prioritizes relevant memory snippets, reducing interference from semantically similar but contextually mismatched information. Evaluations utilising the newly introduced CAME-Bench benchmark, alongside LongMemEval, demonstrate that STITCH achieves state-of-the-art results, surpassing existing methods by a substantial margin, particularly as trajectory length increases.
This improvement suggests that intent-aware memory indexing significantly enhances robustness in long-term reasoning tasks. The authors acknowledge limitations related to ingestion speed, stemming from the multiple language model calls required to construct contextual intent tuples, and a buffered update strategy employed for label evolution which introduces minor latency. Future work could explore hierarchical schema induction and lightweight structural predictors to further optimise the balance between efficiency and performance. The research also highlights the importance of carefully constructed benchmarks, such as CAME-Bench, to accurately assess the capabilities of agentic memory systems in realistic, dynamic environments. The system’s design prioritises precision and structural coherence, offering a promising approach to managing complex, long-term interactions for artificial agents.
👉 More information
🗞 Grounding Agent Memory in Contextual Intent
🧠 ArXiv: https://arxiv.org/abs/2601.10702
