Anthropic, a leading artificial intelligence company, has introduced prompt caching on its API, enabling developers to cache frequently used context between API calls. This innovation allows customers to provide Claude, Anthropic’s AI model, with more background knowledge and example outputs while reducing costs by up to 90% and latency by up to 85% for long prompts. Prompt caching is now available in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.
This technology has far-reaching implications for various applications, including conversational agents, coding assistants, large document processing, and detailed instruction sets. Early customers have seen substantial speed and cost improvements, such as Notion, which is integrating prompt caching into its AI assistant to optimize internal operations and enhance user experience. Simon Last, Co-founder at Notion, praised the technology, stating that it will make Notion AI “faster and cheaper” while maintaining state-of-the-art quality.
Prompt Caching: Revolutionizing Conversational AI with Claude
Prompt caching is a groundbreaking feature now available on the Anthropic API, enabling developers to cache frequently used context between API calls. This innovation allows customers to provide Claude, a conversational AI model, with more background knowledge and example outputs while significantly reducing costs and latency.
When to Use Prompt Caching
Prompt caching is particularly effective in situations where a large amount of prompt context needs to be sent once and then referred to repeatedly in subsequent requests. Some examples of such scenarios include:
Conversational agents: By caching long instructions or uploaded documents, conversational agents can reduce cost and latency for extended conversations.
Coding assistants: Prompt caching can improve autocomplete and codebase Q&A by keeping a summarized version of the codebase in the prompt.
Large document processing: Incorporating complete long-form material, including images, into the prompt without increasing response latency becomes possible with prompt caching.
Detailed instruction sets: Sharing extensive lists of instructions, procedures, and examples to fine-tune Claude’s responses is now more efficient with prompt caching.
Agentic search and tool use: Enhancing performance for scenarios involving multiple rounds of tool calls and iterative changes, where each step typically requires a new API call, can be achieved through prompt caching.
Talk to books, papers, documentation, podcast transcripts, and other long-form content: By embedding the entire document(s) into the prompt, users can ask questions and engage in conversations with the knowledge base.
Benefits of Prompt Caching
Early customers have seen substantial speed and cost improvements with prompt caching for a variety of use cases. For instance:
- Chatting with a book (100,000 token cached prompt): Latency reduction by 79% and cost reduction by 90%.
- Many-shot prompting (10,000 token prompt): Latency reduction by 31% and cost reduction by 86%.
- Multi-turn conversation (10-turn convo with a long system prompt): Latency reduction by 75% and cost reduction by 53%.
Pricing Model for Cached Prompts
Cached prompts are priced based on the number of input tokens cached and how frequently that content is used. Writing to the cache costs 25% more than the base input token price for any given model, while using cached content is significantly cheaper, costing only 10% of the base input token price.
Claude Models and Pricing
Anthropic offers three Claude models: Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku. Each model has a unique set of features and pricing structures:
- Claude 3.5 Sonnet: The most intelligent model to date, with a 200K context window, priced at $3/MTok for input, $3.75/MTok for cache write, and $0.30/MTok for cache read.
- Claude 3 Opus: A powerful model for complex tasks, with a 200K context window, priced at $15/MTok for input, with prompt caching coming soon.
- Claude 3 Haiku: The fastest and most cost-effective model, with a 200K context window, priced at $0.25/MTok for input, $0.30/MTok for cache write, and $0.03/MTok for cache read.
Customer Spotlight: Notion
Notion is integrating prompt caching into Claude-powered features for its AI assistant, Notion AI. By leveraging reduced costs and increased speed, Notion aims to optimize internal operations and create a more elevated and responsive user experience for their customers. As Simon Last, Co-founder at Notion, notes, “We’re excited to use prompt caching to make Notion AI faster and cheaper, all while maintaining state-of-the-art quality.”
External Link: Click Here For More
