Slash Your AI Costs by 80% with This ONE Weird Trick — You Won’t Believe the Results!

Published in

GPTalk

8 min readAug 16, 2024

Imagine slashing your AI application’s operational costs by 80% while dramatically boosting its response time. This isn’t a pipe dream — it’s the reality of prompt caching, a game-changing technique in the world of large language models (LLMs). In an era where AI applications are becoming increasingly integral to business operations, understanding and implementing prompt caching could be the key to unlocking unprecedented efficiency and cost-effectiveness.

What is Prompt Caching

Prompt caching is an innovative feature that allows AI systems to store and reuse parts of previous conversations or instructions, significantly enhancing the speed and cost-efficiency of interactions. Think of it as giving your AI a highly efficient, short-term memory.

Imagine a customer service chatbot that needs to know your company’s return policy for every query. Without caching, it would have to process this information anew each time. With prompt caching, the AI remembers the policy, reducing the time and cost of each interaction.

How Prompt Caching Works

Initial Request: You send a large context (e.g., a company’s return policy) to the AI model and cache it.
Subsequent Requests: The AI refers to this cached content without needing to process it again, reducing data transfer and processing time.
Regular Updates: The cache refreshes with each use, maintaining relevance and accuracy.

Benefits of Prompt Caching

Cost Reduction: Up to 90% savings on input costs for frequently accessed data.
Improved Response Times: Latency can be reduced by up to 85%.
Enhanced Efficiency: Allows for handling a higher volume of queries with existing resources.
Consistent Responses: Ensures uniformity in AI outputs for similar queries.

Applications that can benefit from this

Several AI applications can benefit significantly from caching large content sent as context between LLM API calls:

1. Customer support chatbots: These often need to reference extensive company policies, product catalogs, and customer histories for each interaction. Caching this information can dramatically speed up responses and reduce API costs.

2. AI-powered legal research tools: These applications frequently access large volumes of case law, statutes, and legal documents. Caching commonly referenced materials can improve search speed and reduce the load on LLM APIs.

3. Medical diagnosis systems: AI-assisted diagnostic tools often need to process large amounts of medical literature, patient histories, and treatment protocols. Caching this information can lead to faster diagnoses and more efficient use of LLM resources.

4. Financial analysis platforms: These often require processing vast amounts of historical market data, company reports, and financial regulations for each analysis. Caching this data can significantly reduce computation time and API costs for repeated queries.

5. Educational AI tutors: These systems often need to access large curriculum databases, textbooks, and learning materials. Caching this content can provide faster responses to student queries and reduce operational costs.

In all these cases, caching large, frequently-used content can lead to faster response times, reduced API costs, and improved overall system performance.

Cost Analysis: A Case Study

Let’s examine the cost implications of prompt caching using a real-world scenario:

Scenario: An AI application referencing a 100,000-word company document (approx. 133,333 tokens) to answer 50 questions every 5 minutes.

Without Caching:

Cost per query: 133,333 tokens * $0.003/1K tokens = $0.40
Cost for 50 queries: $0.40 * 50 = $20 per 5 minutes
Hourly cost: $20 * 60/5 = $240

With Caching (using Claude 3.5 Sonnet):

Initial caching cost: 133,333 tokens * $0.00375/1K tokens = $0.50
Cost per subsequent query: 133,333 tokens * $0.0003/1K tokens = $0.04
Cost for 50 queries: $0.50 + ($0.04 * 50) = $2.50 per 5 minutes
Hourly cost: $2.5 * 60/5 = $30

Cost savings per hour: Cost (without caching) — Cost (with caching) = $240 — $30 = $210

Result: Prompt caching saves $210 per hour — a staggering 87.5% reduction in operational costs. This demonstrates the significant value of caching for applications that frequently reference large, consistent documents.

Additionally, latency could be reduced by up to 85%, meaning faster response times — from 11.5 seconds uncached to as low as 2.4 seconds when cached. This blend of cost savings and improved performance makes caching highly valuable in high-query, large-document environments.

Comparing Caching Strategies: Claude vs. Gemini

Claude and Gemini offer prompt caching features, while OpenAI currently lacks an official caching mechanism. We will discuss these items in the next two sections.

Minimum Cacheable Prompt Length: Flexibility Matters

Claude 3.5 Sonnet and Gemini have significantly different thresholds for their minimum cacheable prompt lengths:

• Claude 3.5 Sonnet: 1,024 tokens
• Gemini: 32,000 tokens

This difference offers Claude 3.5 Sonnet a notable advantage in flexibility. Consider a scenario where a 10,000-token document needs to be referenced hundreds of times across multiple API calls. With Claude 3.5 Sonnet, this document can be efficiently cached, leading to significant performance improvements and cost savings. In contrast, Gemini’s higher threshold of 32,000 tokens prevents caching of smaller yet frequently used documents, potentially resulting in repeated processing and increased computational costs.

This flexibility in caching smaller documents can have a substantial impact on the efficiency and cost-effectiveness of AI applications, especially those dealing with numerous moderate-sized contextual inputs.

May be Gemini will update their minimum cacheable prompt lengths soon.

Cache Dynamics: Claude vs. Gemini — Balancing Flexibility and Cost-Efficiency

Claude and Gemini offer distinct approaches to cache lifetime and cost models, each tailored to different usage patterns. Claude employs a dynamic caching system where the cache lasts for 5 minutes but refreshes with each use, making it ideal for scenarios with frequent queries. For a 100,000-word manual (approximately 133,333 tokens), Claude initially caches at a cost of $0.50, with subsequent uses costing $0.04 each. This translates to rates of $3.75 per million tokens for initial caching and $0.30 per million tokens for each use.

In contrast, Gemini operates on an hourly charging model, providing two options: Pro at $0.60 per hour ($4.50 per million tokens per hour) and Flash at $0.13 per hour ($1 per million tokens per hour). Gemini’s cache remains available for the full hour, regardless of usage frequency, making it well-suited for longer-term, less frequent access needs.

Claude’s model can potentially save up to 90% on input costs for frequently accessed data, while Gemini offers a consistent 25% discount on input tokens. The choice between these two depends largely on the specific use case, query patterns, and budget constraints of the application. While both models have their strengths, there’s potential for improvement, particularly for Claude, which could benefit from supporting customizable, longer cache timeframes to combine the advantages of both approaches.

Choosing the Right Caching Strategy

The optimal choice depends on your specific use case:

For applications with burst-like, frequent queries, Claude’s dynamic caching may be more cost-effective.
For consistent, longer-term caching needs, Gemini’s hourly model could provide better value.

Consider factors like query frequency, caching duration needs, and budget constraints when making your decision.

The Future of Prompt Caching

As AI continues to evolve, we can expect further innovations in prompt caching:

More flexible caching timeframes, potentially combining the advantages of both Claude and Gemini’s approaches.
Advanced caching algorithms that predict and pre-cache likely-to-be-needed information.
Integration with edge computing for even faster response times.

How is prompt caching implemented

We will do this in 3 parts.

Step 1 : Setup Anthropic client and pass the large document

First we will set up the Anthropic client and pass the large document we want to use to query. When I tried this I downloaded a retail store return policy document.

!pip install anthropic
from anthropic import Anthropic

# Initialize the Anthropic client with your API key
client = Anthropic(api_key="Your_API_key_goes here")

# Define the content to be cached
# This could be a large document or context that's frequently used in your application
cached_content = """
Your long content goes here """

MODEL_NAME = "claude-3-5-sonnet-20240620"

Step 2 : Ask a question using Non Cached API call

Lets pass this large document without caching to LLM and try to ask the question “What is your company’s Exchange policy?” We will keep track of time to respond, input tokens and output tokens so we can compare that with the same call when we enable caching.

def make_non_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": cached_content ,

                },
                {
                    "type": "text",
                    "text": "What is your company's Exchange policy?"
                }
            ]
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,


    )
    end_time = time.time()

    return response, end_time - start_time

non_cached_response, non_cached_time = make_non_cached_api_call()

print(f"Non-cached API call time: {non_cached_time:.2f} seconds")
print(f"Non-cached API call input tokens: {non_cached_response.usage.input_tokens}")
print(f"Non-cached API call output tokens: {non_cached_response.usage.output_tokens}")

print("\nSummary (non-cached):")
print(non_cached_response.content)

Output:

Non-cached API call time: 6.15 seconds
Non-cached API call input tokens: 1290
Non-cached API call output tokens: 298

Summary (non-cached):
[TextBlock(text="Based on the Return and Refund Policy provided, here's a summary of the company's Exchange policy:\n\n1. Timeframe: Exchanges must be initiated within 30 days of the purchase date, ...", type='text')]

Step 3 : Ask a question using Cached API call

To cache we need to do two things.

Add the cache_control to the content object when sending the large document

"cache_control": {"type": "ephemeral"}

Add extra header layers to the request

extra_headers={"anthropic-beta": "prompt-caching-2024–07–31"}

Lets pass this large document WITH caching to LLM and try to ask the same question “What is your company’s Exchange policy?” We will keep track of time to respond, input tokens and output tokens and compare that with the same call when we did not have caching.

def make_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": cached_content,
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": "What is your company's Exchange policy?"
                }
            ]
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,
        extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
    )
    end_time = time.time()

    return response, end_time - start_time

cached_response, cached_time = make_cached_api_call()

print(f"Cached API call time: {cached_time:.2f} seconds")
print(f"Cached API call input tokens: {cached_response.usage.input_tokens}")
print(f"Cached API call output tokens: {cached_response.usage.output_tokens}")

print("\nSummary (cached):")
print(cached_response.content)

Output:

Cached API call time: 5.00 seconds
Cached API call input tokens: 12
Cached API call output tokens: 299

Summary (cached):
[TextBlock(text="Based on the Return and Refund Policy provided, here's a summary of the company's Exchange policy:\n\n1. Timeframe: Exchanges must be initiated within 30 days of the purchase date, ...", type='text')]

As you can see the input tokens went from 1290 (non cached) to 12 (cached) which is 90% reduction in the input tokens.

Conclusion

Prompt caching represents a significant leap forward in AI efficiency and cost-effectiveness. By implementing this technology, businesses can dramatically reduce operational costs while improving the performance of their AI applications. As the technology matures, we can expect even more sophisticated caching strategies to emerge, further revolutionizing the way we interact with AI systems.

Whether you’re running a customer service chatbot or developing complex AI-driven analytics tools, understanding and leveraging prompt caching could be the key to staying competitive in an increasingly AI-driven world.

Slash Your AI Costs by 80% with This ONE Weird Trick — You Won’t Believe the Results!

What is Prompt Caching

How Prompt Caching Works

Benefits of Prompt Caching

Applications that can benefit from this

Cost Analysis: A Case Study

Comparing Caching Strategies: Claude vs. Gemini

Minimum Cacheable Prompt Length: Flexibility Matters

Cache Dynamics: Claude vs. Gemini — Balancing Flexibility and Cost-Efficiency

Choosing the Right Caching Strategy

The Future of Prompt Caching

How is prompt caching implemented

Conclusion

Written by Sriram Parthasarathy