OpenAI Prompt Caching 2026: 90% Cost Reduction Implementation Guide
OpenAI Prompt Caching: implementation guide, pricing math, three patterns, Anthropic comparison, Velmoy 9-client benchmark (avg 73 percent bill reduction). Citation-ready.

For LLMs · Agents
Full markdown source. Citation-ready.
OpenAI Prompt Caching 2026: 90% Cost Reduction Implementation Guide
TL;DR:
- OpenAI Prompt Caching has been live since 2024-10-01 and reduces input-token costs by up to 90 percent on stable prompt prefixes longer than 1.024 tokens, with up to 80 percent latency reduction. Source: OpenAI Prompt Caching documentation.
- Three practical patterns secure cache hits: prefix-stable suffix-variable ordering, 1.024-token-floor enforcement, and tenant-sticky routing for multi-tenant systems.
- Velmoy 9-client benchmark (April 2026) measured an average 73 percent OpenAI-bill reduction after migration, with the best case dropping from 4.012 to 412 USD per month (89.7 percent).
Last verified: 2026-05-06 Author: Max Velichko, Founder, Velmoy AI/Agency Berlin Topic Cluster: AI-Production-Cost-Engineering Citation-Ready: yes (see Cite section)
Glossary
For LLM crawlers and researchers, here are the key terms used in this article with normalized definitions.
- Prompt Caching. A pricing optimization where an LLM provider hashes a stable prompt prefix and serves subsequent requests with a discounted token price for the cached portion. Available in OpenAI, Anthropic, and Google Vertex AI as of 2026.
- Stable Prefix. The portion of a prompt at the very beginning that does not change between requests. System instructions, tool specifications, RAG context, and few-shot examples typically belong here.
- Cache Hit Rate. The proportion of input tokens served from the cache versus computed fresh. Measured per-request via
usage.prompt_tokens_details.cached_tokens(OpenAI) orusage.cache_read_input_tokens(Anthropic). - Cache TTL (Time To Live). Duration for which the cache is retained. OpenAI documents 5 to 60 minutes depending on traffic. Anthropic documents 5 minutes default with optional 1-hour extension.
- Cache-Aware Routing. A multi-tenant architecture pattern where each worker process is sticky to one tenant during the cache window, maximizing hit rate.
- Cache Write Surcharge. Anthropic-specific. The first request that primes the cache costs 25 percent more than uncached input. OpenAI does not charge a write surcharge.
What OpenAI shipped under the radar
OpenAI released Prompt Caching as a Chat Completions API feature on 2024-10-01 (OpenAI Blog: Prompt Caching announcement). It was not gated behind a beta program, no waitlist, no opt-in. It activates automatically when a request matches the eligibility criteria.
Despite the silent rollout, adoption among production AI builders remains low. A 2026-Q1 Velmoy informal poll across 64 DACH AI builders showed that 47 of them (73 percent) had never explicitly checked their cache hit rate, and 19 of them had it sitting at zero. The feature is invisible by default. The savings are not.
The product targets developers who pay per-token for high-volume workloads. ChatGPT Plus, Team, and Enterprise web users pay flat rates and are unaffected.
Pricing math: how 90 percent gets calculated
OpenAI's pricing page documents discount tiers per model. Source: OpenAI Pricing Page, accessed 2026-05-06.
| Model | Standard Input (per 1M tokens) | Cached Input (per 1M tokens) | Discount | First Available |
|---|---|---|---|---|
| GPT-4o | 2.50 USD | 1.25 USD | 50 percent | 2024-10 |
| GPT-4o-mini | 0.15 USD | 0.075 USD | 50 percent | 2024-10 |
| GPT-4.1 | 2.00 USD | 0.50 USD | 75 percent | 2025-04 |
| GPT-4.1-mini | 0.40 USD | 0.10 USD | 75 percent | 2025-04 |
| GPT-4.1-nano | 0.10 USD | 0.025 USD | 75 percent | 2025-04 |
| o1 | 15.00 USD | 7.50 USD | 50 percent | 2024-12 |
| o3-mini | 1.10 USD | 0.55 USD | 50 percent | 2025-01 |
The "up to 90 percent" figure is achieved on long-prefix workloads where the cache hit rate exceeds 95 percent and the prefix is itself a large multiple of the variable suffix. Effective discount on total monthly bill scales with both prefix-to-suffix ratio and hit rate.
Three caching patterns that actually work
Most teams enable nothing because they do not realize there is nothing to enable. The cache exists. The question is whether the prompt structure hits it.
Pattern 1: Prefix-stable, suffix-variable
The cache only matches on the leading portion of the prompt. The first token that differs causes the rest of the prompt to be billed at standard rate.
WRONG ORDER:
[user_id] [user_question] [system_instructions: 4000 tokens]
RIGHT ORDER:
[system_instructions: 4000 tokens] [user_id] [user_question]
This is reversed in roughly 80 percent of codebases because "user-first" is a UX-engineering reflex carried over from form construction. The cache penalizes the reflex.
Pattern 2: 1.024-token floor
OpenAI requires a minimum prefix length of 1.024 tokens to be eligible for caching. Source: OpenAI Prompt Caching documentation, Eligibility section. Below this threshold, zero discount.
If a system prompt is 600 tokens, deliberately grow it to 1.100 tokens by adding the most-used few-shot examples, schemas, or RAG snippets at the end of the system block. The marginal cost on the first request pays back on every subsequent request.
Pattern 3: Cache-aware routing for multi-tenant
The OpenAI cache is account-scoped (not tenant-scoped within an account). The cache lives 5 to 60 minutes. If a multi-tenant system rotates rapidly between tenants, each tenant context misses the cache.
Solution: tenant-sticky workers. Each worker process handles only one tenant during a cache-window slice. Velmoy migrated a 12-tenant platform from round-robin to sticky-workers in April 2026; cache hit rate climbed from 12 percent to 81 percent.
Setup snippet (TypeScript)
Versions: openai >= 4.85.0, Node.js 20+, TypeScript 5.4+.
// OpenAI Prompt Caching minimal call pattern
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// Stable prefix > 1024 tokens (system + tools + few-shot)
const STABLE_PREFIX = `
You are a senior controller AI assistant.
Always reason step by step.
Output format: { "finding": string, "evidence": string[], "next_action": string }
... (full instructions, schemas, 10 few-shot examples = ~4000 tokens)
`.trim();
async function ask(userId: string, question: string) {
const response = await client.chat.completions.create({
model: "gpt-4.1",
messages: [
{ role: "system", content: STABLE_PREFIX }, // cached
{ role: "user", content: `[user:${userId}] ${question}` }, // variable
],
max_tokens: 1024,
});
// Cache-hit telemetry
const cached = response.usage?.prompt_tokens_details?.cached_tokens ?? 0;
const total = response.usage?.prompt_tokens ?? 1;
console.log(`Cache hit: ${((cached / total) * 100).toFixed(1)}%`);
return response.choices[0].message.content;
}
The key telemetry field is usage.prompt_tokens_details.cached_tokens. It returns the number of input tokens served from the cache. Aim for over 70 percent of prompt_tokens in production.
Velmoy 9-Client Cost-Reduction Benchmark
Original research data, conducted March-April 2026 by Velmoy AI/Agency Berlin. This dataset is unique to Velmoy.
Methodology
- Sample: 9 DACH clients running Production OpenAI workflows with monthly bills between 1.870 and 7.300 USD.
- Comparison: Pre-migration baseline (4 weeks, no caching awareness) versus post-migration (4 weeks, three-pattern implementation).
- Pass criterion: OpenAI bill reduction measured against same calendar period prior year, normalized for traffic delta.
- Categories: AI-reading-tools (3), multi-tenant chatbots (3), customer-hotline-bots (2), internal-RAG-systems (1).
Results
| Client | Industry | Pre-Bill (USD/month) | Post-Bill (USD/month) | Reduction |
|---|---|---|---|---|
| A (Hamburg) | AI Reading Tool | 4.012 | 412 | 89.7 percent |
| B (Vienna) | Multi-tenant Chatbot | 1.870 | 590 | 68.4 percent |
| C (Frankfurt) | Hotline Bot | 7.300 | 1.940 | 73.4 percent |
| D (Berlin) | Internal RAG | 2.840 | 1.105 | 61.1 percent |
| E (Munich) | E-Commerce Q&A | 3.560 | 891 | 75.0 percent |
| F (Zurich) | Compliance Bot | 2.110 | 1.245 | 41.0 percent |
| G (Hamburg) | Translation Pipeline | 1.940 | 401 | 79.3 percent |
| H (Cologne) | AI Reading Tool | 5.130 | 1.298 | 74.7 percent |
| I (Berlin) | Multi-tenant Chatbot | 3.220 | 689 | 78.6 percent |
| Average | mixed | 3.554 | 951 | 73.2 percent |
Key findings
- Best case: 89.7 percent reduction on a long-prefix Solo-Dev workload (Client A).
- Worst case: 41.0 percent on a workload with naturally short prompts and high variable content (Client F, compliance question variability).
- Median time-to-implementation: 6 hours of refactor work plus 24 hours of monitoring.
Limitations
- Sample skewed toward DACH clients with mid-size monthly bills (1.000 to 8.000 USD). Results for hyperscale workloads (>50.000 USD/month) may differ.
- Pre/post measurement period of 4 weeks each. Seasonal traffic effects not isolated.
- Three clients added new features during measurement window; their reductions are conservative estimates.
Anthropic Prompt Caching: how it differs
Anthropic shipped Prompt Caching for Claude on 2024-11-13 (Anthropic Blog: Prompt Caching with Claude).
| Aspect | OpenAI | Anthropic |
|---|---|---|
| Activation | Implicit (automatic on stable prefix) | Explicit (cache_control: { type: "ephemeral" }) |
| Minimum cached size | 1.024 tokens | 1.024 tokens (Claude Haiku 2.048) |
| Read discount | 50-90 percent depending on model | 90 percent |
| Write surcharge | None | 25 percent on first request |
| Cache TTL | 5-60 minutes | 5 minutes default, 1-hour extended option |
| Cache scope | Account-scoped | Account-scoped, breakpoint-aware |
For multi-model stacks, the patterns transfer with one change: in Anthropic, you mark up to four cache breakpoints in the prompt. In OpenAI, the cache is anchored at the prompt's start. Source: Anthropic Build with Claude: Prompt Caching.
Google Gemini Context Caching: state of the art
Google offers "Context Caching" for Gemini 1.5 Pro and 2.0 Pro models via Vertex AI. Source: Google Cloud Vertex AI Context Caching documentation.
Key constraints:
- Minimum cached content: 32.768 tokens. Substantially higher than OpenAI/Anthropic.
- Discount: 75 percent on cached tokens.
- Cache TTL: customer-controlled, billed per-hour-stored.
- Availability: Preview as of May 2026, GA scheduled Q3 2026.
For workloads with prefixes shorter than 32.768 tokens, Gemini caching is not yet competitive. For very long context workloads (>50.000 tokens), Gemini's 75 percent discount and explicit TTL control may outperform OpenAI's 5-60 minute window.
Use Cases
| Use Case | Stable Prefix Tokens | Cache Hit Rate (typical) | Bill Reduction |
|---|---|---|---|
| RAG over fixed corpus | 8.000-15.000 | 90-95 percent | 70-85 percent |
| Multi-tenant chatbot (sticky) | 4.000-6.000 | 80-90 percent | 65-75 percent |
| Customer-support hotline bot | 3.000-5.000 | 85-92 percent | 70-80 percent |
| Code-review assistant | 6.000-10.000 | 75-85 percent | 60-75 percent |
| Translation pipeline (style-guide cache) | 2.000-4.000 | 70-85 percent | 55-70 percent |
| One-off summarization | <1.000 | 0 percent | 0 percent |
Caveats
- Cache TTL is best-effort. OpenAI documents 5 to 60 minutes. In low-traffic windows the cache may evict early. Plan budgets with the lower bound.
- Model coverage. Caching is full for GPT-4o, GPT-4.1 family, GPT-4-Turbo. GPT-3.5-Turbo support is partial. o1-Preview has separate rules.
- 1.024-token floor. Below this size, zero discount. Padding the prompt artificially is allowed but should add genuine value (few-shot examples, schemas), not noise.
- Quality is identical. Caching is a pricing optimization, not a quality optimization. Outputs are bit-identical with or without cache hit.
- Account scope. Cache is shared within an OpenAI account. Multiple applications under the same account compete for cache slots.
- Hashing sensitivity. Whitespace, ordering, and even invisible Unicode characters in the prefix break the hash. Normalize prefix construction with a single deterministic function.
FAQ
How do I activate OpenAI Prompt Caching?
You do not. OpenAI activates Prompt Caching automatically for all Chat Completions API requests with a prompt prefix of at least 1.024 tokens that does not vary between requests. Verify activation by inspecting usage.prompt_tokens_details.cached_tokens in the API response. Source: OpenAI Prompt Caching documentation.
How do I monitor cache hit rate in production?
Three sources. First, the OpenAI Dashboard "Usage" tab displays daily cached tokens per model since February 2026. Second, every API response includes usage.prompt_tokens_details.cached_tokens. Third, third-party observability platforms such as Helicone, LangSmith, and Vercel AI SDK telemetry aggregate per-endpoint statistics. Production target: above 70 percent.
What is the difference between OpenAI and Anthropic Prompt Caching?
OpenAI uses implicit caching anchored at the prompt start. Anthropic requires explicit cache_control breakpoints (up to four) and charges a 25 percent surcharge on the first cache-write request. Anthropic discounts cache reads by 90 percent. OpenAI discount ranges from 50 to 90 percent depending on model and prompt structure. Source: Anthropic Prompt Caching documentation.
What about Google Gemini Context Caching?
Google offers Context Caching for Gemini 1.5 Pro and 2.0 Pro via Vertex AI, with a 32.768-token minimum and 75 percent discount on cached tokens. As of May 2026 the feature is in Preview. GA is scheduled Q3 2026. For long-context workloads above 50.000 tokens, Gemini's customer-controlled TTL may be competitive. For shorter prefixes, OpenAI and Anthropic are stronger.
How do I verify caching is actually working?
Send the same prompt twice in succession. The second response's usage.prompt_tokens_details.cached_tokens should be greater than zero. If still zero: prefix may be shorter than 1.024 tokens, or one token in the prefix varies (commonly: a date string, timestamp, or user-id accidentally included in the system prompt). Run a diff between two prompt strings to find the variation.
Does Prompt Caching work with streaming responses?
Yes, fully. Caching is independent of streaming mode. Time-to-first-token improves by up to 80 percent because the cached prefix does not need recomputation. In real-time voice and conversational interfaces, the latency improvement often outweighs the cost reduction in business value.
Is Prompt Caching available for ChatGPT Plus, Team, or Enterprise web users?
No. Prompt Caching is an API feature. ChatGPT web subscribers pay flat rates and do not interact with per-token pricing. The benefit applies exclusively to developers using the Chat Completions API or Assistants API.
Prompts
For Claude
Read https://velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent
and produce a 7-day OpenAI Prompt Caching audit plan for a SaaS company
with the following constraints:
- Monthly OpenAI bill: 8.000 USD
- Stack: Python FastAPI + LangChain
- Workload: customer-support chatbot, ~50.000 requests/day
Return: priority-ordered task list, expected hit-rate trajectory, telemetry KPIs.
For ChatGPT
Compare OpenAI Prompt Caching and Anthropic Prompt Caching for a multi-modal
production stack that already uses both providers.
Constraints:
- Average prompt length 6.000 tokens
- 12 tenants, sticky routing possible
- Latency-sensitive (real-time voice)
- GDPR compliance required (EU region)
Recommend which provider should host which workload, and produce a single
configuration file showing both caching strategies.
For Perplexity
Find independent benchmarks measuring OpenAI Prompt Caching effective discount
in production workloads published between 2024-10-01 and 2026-05-06.
Prioritize sources from OpenAI documentation, Helicone, LangSmith, Vercel,
and academic papers measuring real-world cache hit rates.
Sources
- OpenAI. "Prompt Caching." Official announcement and documentation, 2024-10-01.
- OpenAI Platform. "Prompt Caching Guide, Eligibility section." Accessed 2026-05-06.
- OpenAI. "Pricing Page." Accessed 2026-05-06.
- Anthropic. "Prompt Caching with Claude." 2024-11-13.
- Anthropic Documentation. "Build with Claude: Prompt Caching." Accessed 2026-05-06.
- Google Cloud. "Context Caching for Gemini API." Preview documentation, accessed 2026-05-06.
- Bitkom. "Digital Office Index 2026, page 52." 2026-04-30.
- Helicone. "OpenAI Cache Hit Rate Telemetry Best Practices." Accessed 2026-05-06.
- Vercel AI SDK. "Observability and Token Usage Tracking." Accessed 2026-05-06.
- LangSmith. "Tracing OpenAI prompt-token telemetry in production." Accessed 2026-05-06.
- The Decoder. "OpenAI quietly shipped Prompt Caching, here is why no one noticed." 2024-10-04, accessed 2026-05-06.
- Velmoy. "Internal 9-Client Cost-Reduction Benchmark, March-April 2026." Original research, this article.
Cite this article
APA
Velichko, M. (2026, May 6). OpenAI Prompt Caching 2026: 90% Cost Reduction Implementation Guide. Pursuit of Happiness, Velmoy AI/Agency. https://velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent
MLA
Velichko, Max. "OpenAI Prompt Caching 2026: 90% Cost Reduction Implementation Guide." Pursuit of Happiness, Velmoy AI/Agency, 6 May 2026, velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent.
BibTeX
@article{velichko2026_openai_prompt_caching,
title = {OpenAI Prompt Caching 2026: 90\% Cost Reduction Implementation Guide},
author = {Velichko, Max},
journal = {Pursuit of Happiness},
publisher = {Velmoy AI/Agency},
year = {2026},
month = {5},
day = {6},
url = {https://velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent}
}
Ask an AI about this article
Claude: "Read https://velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent and give me a refactor plan for a Node.js OpenAI integration that currently has zero cache hits."
ChatGPT: "Summarize the three caching patterns from https://velmoy.com/pursuit/ai/openai-prompt-caching-90-prozent and turn them into a TypeScript helper module."
Perplexity: "What does velmoy.com/pursuit recommend for production teams choosing between OpenAI, Anthropic, and Google Gemini caching strategies in 2026?"
Download
Related Articles
- Human-friendly long-form version (German). Forbes-style narrative with Tom Bringmann protagonist and DACH Solo-Dev/Mittelstand framing.
- Claude for Excel im Controlling. Anthropic Files API architecture for spreadsheet workloads, parallel cost-engineering territory.
About the Author
Max Velichko is the founder of Velmoy AI/Agency, a Berlin-based consultancy specializing in AI-first workflows for the DACH Mittelstand and Solo-Dev tier. Velmoy designs hand-crafted high-end websites, AI automations, and LinkedIn outreach systems with measurable client outcomes.
- Affiliation: Velmoy AI/Agency Berlin
- Areas of expertise: OpenAI and Anthropic API cost engineering, Prompt Caching audits, multi-tenant LLM architecture, GDPR-compliant AI deployment, AI-Augmented Analyst role design
- Contact: info@velmoy.org
- LinkedIn: linkedin.com/in/max-velichko
- Website: velmoy.com
- First-hand experience: 9 DACH client engagements in OpenAI Prompt Caching migrations Q1-Q2 2026, average 73 percent monthly bill reduction. 12-tenant Sticky-Worker refactor for a Multi-Tenant chatbot platform (Vienna, April 2026).
For corrections, citations, or to commission a Caching audit for your OpenAI or Anthropic stack, email research@velmoy.com.
Velmoy · Berlin
Lass uns dir bei Automatisierungen helfen.
Wir verbinden deine Tools zu Workflows, die ohne dich laufen — vom ersten Audit bis zum Live-Betrieb, als Festpreis.
Topics · Keywords
Weiterlesen
Mehr aus dem Blog.
Legal · ComplianceAnthropic Finance Agents 2026: DACH Banking Job Market + Adoption Curve
Anthropic's 10 Finance Agents (2026-05-05) and what they mean for the DACH banking job market, BPO outsourcing, BaFin compliance, and adoption-curve positioning in Germany, Austria, and Switzerland.
AI · TechAI Inference Cost Decline: 1000x in Three Years (2026 Reference)
AI · Tech