Scientists are increasingly focused on optimising the performance of large language models (LLMs) as they become integral to search, assistance and agentic workflows. Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam and Weihua Zhu, all engineers at Apple, demonstrate a novel approach to semantic caching that addresses the limitations of current tiered static-dynamic designs. Their research introduces Krites, an asynchronous caching policy which leverages LLM-based judgement to expand static cache coverage without impacting serving latency. This work is significant because it overcomes the traditional trade-off between cache conservatism and accuracy, increasing the proportion of requests served with curated static answers by up to seven times in conversational and search scenarios, as shown through trace-driven simulations.
As LLMs become integral to search, assistance, and automated processes, reducing the computational cost and latency of these systems is paramount. Current semantic caching techniques rely on embedding similarity to retrieve answers, but struggle with a fundamental trade-off; overly strict similarity thresholds miss opportunities for reuse, while lenient thresholds risk delivering inaccurate responses.
Krites addresses this challenge by introducing an asynchronous verification step using another LLM to judge the suitability of cached answers. The core innovation lies in its ability to operate like a standard caching system on the primary processing path. When a potential answer is found in the static cache but falls just below the similarity threshold, Krites initiates a background check with a separate LLM judge.
This judge determines if the cached response is appropriate for the new prompt, and if approved, the answer is seamlessly added to the dynamic cache for future use. Through trace-driven simulations using both conversational and search-based workloads, this approach increases the proportion of requests fulfilled with curated static answers by up to 3.9times compared to existing methods.
This improvement is achieved without any increase in critical path latency, meaning users experience faster response times while benefiting from more reliable, pre-vetted information. Initial evaluations reveal that Krites increases the fraction of requests served with curated static answers by up to 136.5% for conversational traffic derived from the SemCacheLMArena dataset, relative to tuned baseline policies.
For search-style queries using the SemCacheSearchQueries dataset, Krites achieves an even more substantial gain of 290.3% in static-origin serves. These improvements are measured as increases in the proportion of requests fulfilled using pre-computed, curated static answers, encompassing both direct static cache hits and those promoted to the dynamic cache via verification.
Analysis of the SemCacheLMArena dataset shows Krites more than doubles the fraction of traffic served with curated static answers, starting from a cold dynamic cache. Furthermore, on the SemCacheSearchQueries dataset, Krites increases static-origin serves by over 290% relative to the baseline, despite utilising thresholds already identified as Pareto-optimal in prior vCache analysis.
Krites effectively expands static cache coverage over time by populating the dynamic tier with verified links to static answers through auxiliary overwrites. A 60% coverage-based head selection strategy initiates the construction of the static tier from the history prefix of each benchmark dataset. This involved identifying the smallest set of equivalence classes, groupings of semantically similar prompts, whose cumulative frequency accounted for 60% of requests within the history data.
One canonical representative prompt, specifically the shortest within that class, was then deterministically chosen from the history prefix to populate the static tier, effectively pre-computing answers at a defined time. The remaining 80% of requests constituted the dynamic traffic stream, beginning with an empty dynamic tier that would be populated online as requests arrived.
Krites operates on the principle of verifying potentially reusable static responses. When a prompt’s nearest static neighbour falls below the static similarity threshold, Krites asynchronously invokes a judge to assess the acceptability of the static response for the new prompt. This judgement, instantiated directly from the benchmark equivalence relation, determines whether the static response accurately addresses the incoming query.
Approved matches are then promoted into the dynamic cache, expanding the reach of curated static answers over time and enabling reuse for future, similar requests. Evaluations were conducted using trace-driven simulations on two open benchmarks, SemCacheLMArena and SemCacheSearchQueries, representing conversational and search-style workloads respectively.
To ensure model-agnosticism and consistency with prior work, a binary decision function, J(q, h, a), was used to approve promotions, leveraging the existing benchmark labels rather than running a live LLM judge during simulation. The evaluation focused solely on the held-out evaluation stream, preventing leakage from the static tier construction into the reported results and providing a robust assessment of online performance.
Static tier construction involved selecting the smallest set of equivalence classes accounting for 60% of history requests, with one canonical representative prompt chosen deterministically per class. Baseline error rates were maintained at roughly one to two percent, ensuring a strong comparison point for Krites’ performance. The asynchronous nature of Krites’ verification process preserves critical path latency and error behaviour for requests triggering verification, leaving the on-path decision rule unchanged.
The relentless rise of large language models demands a pragmatic response to escalating computational costs. Simply scaling up infrastructure isn’t sustainable, hence the growing focus on caching techniques to reuse previously generated responses. However, existing systems often force a blunt choice between accuracy and efficiency; a cautious approach misses opportunities, while an aggressive one risks serving up incorrect information.
This work offers a subtle but potentially significant refinement, introducing a system that intelligently expands the scope of cached responses without compromising on serving speed. Krites doesn’t reinvent the wheel but rather adds a layer of asynchronous verification, allowing for a more nuanced assessment of semantic similarity by leveraging another language model to validate potential cache hits that fall just below a pre-defined threshold.
This ‘second opinion’ is crucial, because it allows systems to cautiously extend their static cache, the most efficient tier, with responses that might previously have been discarded. The implications are considerable, particularly for high-volume applications like search and conversational AI where even marginal gains in efficiency translate to substantial savings.
The effectiveness of Krites hinges on the reliability of the judging LLM itself, a point rightly highlighted by ongoing work in benchmark development. Future research will likely explore adaptive strategies, tailoring the verification process to the complexity of the query and the confidence level of the initial similarity assessment.
👉 More information
🗞 Asynchronous Verified Semantic Caching for Tiered LLM Architectures
🧠 ArXiv: https://arxiv.org/abs/2602.13165
