Anthropic Unveils Prompt Caching for Claude AI Models Boosting Speed

Anthropic, a leading artificial intelligence company, has introduced prompt caching on its API, enabling developers to cache frequently used context between API calls. This innovation allows customers to provide Claude, Anthropic’s AI model, with more background knowledge and example outputs while reducing costs by up to 90% and latency by up to 85% for long prompts. Prompt caching is now available in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.

This technology has far-reaching implications for various applications, including conversational agents, coding assistants, large document processing, and detailed instruction sets. Early customers have seen substantial speed and cost improvements, such as Notion, which is integrating prompt caching into its AI assistant to optimize internal operations and enhance user experience. Simon Last, Co-founder at Notion, praised the technology, stating that it will make Notion AI “faster and cheaper” while maintaining state-of-the-art quality.

Prompt Caching: Revolutionizing Conversational AI with Claude

Prompt caching is a groundbreaking feature now available on the Anthropic API, enabling developers to cache frequently used context between API calls. This innovation allows customers to provide Claude, a conversational AI model, with more background knowledge and example outputs while significantly reducing costs and latency.

When to Use Prompt Caching

Prompt caching is particularly effective in situations where a large amount of prompt context needs to be sent once and then referred to repeatedly in subsequent requests. Some examples of such scenarios include:

Conversational agents: By caching long instructions or uploaded documents, conversational agents can reduce cost and latency for extended conversations.

Coding assistants: Prompt caching can improve autocomplete and codebase Q&A by keeping a summarized version of the codebase in the prompt.

Large document processing: Incorporating complete long-form material, including images, into the prompt without increasing response latency becomes possible with prompt caching.

Detailed instruction sets: Sharing extensive lists of instructions, procedures, and examples to fine-tune Claude’s responses is now more efficient with prompt caching.

Agentic search and tool use: Enhancing performance for scenarios involving multiple rounds of tool calls and iterative changes, where each step typically requires a new API call, can be achieved through prompt caching.

Talk to books, papers, documentation, podcast transcripts, and other long-form content: By embedding the entire document(s) into the prompt, users can ask questions and engage in conversations with the knowledge base.

Benefits of Prompt Caching

Early customers have seen substantial speed and cost improvements with prompt caching for a variety of use cases. For instance:

  • Chatting with a book (100,000 token cached prompt): Latency reduction by 79% and cost reduction by 90%.
  • Many-shot prompting (10,000 token prompt): Latency reduction by 31% and cost reduction by 86%.
  • Multi-turn conversation (10-turn convo with a long system prompt): Latency reduction by 75% and cost reduction by 53%.

Pricing Model for Cached Prompts

Cached prompts are priced based on the number of input tokens cached and how frequently that content is used. Writing to the cache costs 25% more than the base input token price for any given model, while using cached content is significantly cheaper, costing only 10% of the base input token price.

Claude Models and Pricing

Anthropic offers three Claude models: Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku. Each model has a unique set of features and pricing structures:

  • Claude 3.5 Sonnet: The most intelligent model to date, with a 200K context window, priced at $3/MTok for input, $3.75/MTok for cache write, and $0.30/MTok for cache read.
  • Claude 3 Opus: A powerful model for complex tasks, with a 200K context window, priced at $15/MTok for input, with prompt caching coming soon.
  • Claude 3 Haiku: The fastest and most cost-effective model, with a 200K context window, priced at $0.25/MTok for input, $0.30/MTok for cache write, and $0.03/MTok for cache read.

Customer Spotlight: Notion

Notion is integrating prompt caching into Claude-powered features for its AI assistant, Notion AI. By leveraging reduced costs and increased speed, Notion aims to optimize internal operations and create a more elevated and responsive user experience for their customers. As Simon Last, Co-founder at Notion, notes, “We’re excited to use prompt caching to make Notion AI faster and cheaper, all while maintaining state-of-the-art quality.”

More information
External Link: Click Here For More
Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025