← All concepts

Prompt caching

Pay 10–50% of input cost on cached tokens. The art is choosing where the static-vs-dynamic boundary lives.

7 projects 3 insights 3 variants
TL;DR 8 min read

A typical agent session is dominated by repeating the same 30K-token system prompt and slowly-growing 50K-token history. Marking these as cacheable buys you a ~10× steady-state cost reduction. The interesting decision isn’t “should I cache” — it’s where to draw the boundary between cacheable and per-turn content. Get the boundary right and you spend $0.02 per turn; get it wrong and you spend $0.24.

Prompt caching

Modern LLM APIs let you mark prefix segments as cacheable: future calls with the same prefix pay a fraction of the input cost. Anthropic charges ~10% for cached tokens; OpenAI ~50%. On a long agent session this is the difference between a sustainable product and one that bleeds money on every turn.

The interesting bit isn’t whether to cache. It’s where you draw the boundary between static (cacheable) and dynamic (per-turn) content.

Anatomy of a cached agent prompt

flowchart TB
subgraph SP[System prompt]
  direction LR
  BG[base instructions]
  SK[skills / tool list]
  BD[bounds / scope]
end
subgraph DYN[Per-session prefix]
  UM[user / tenant identity]
  TS[session start time]
end
subgraph H[Conversation history]
  H1[turn 1] --> H2[turn 2] --> HN[turn N]
end
SP --> CB1[cache breakpoint · org-static]
CB1 --> DYN
DYN --> CB2[cache breakpoint · session-static]
CB2 --> H
H --> Now[Current message]
class CB1,CB2 cb
Two breakpoints split the prompt into org-static, session-static, and turn-dynamic regions.

Breakpoint 1 makes the system prompt reusable across all sessions — even cold-starts can hit a cached prefix. Breakpoint 2 makes it reusable across turns within one session. After the second turn, steady-state cost is dominated by what comes after breakpoint 2.

The non-obvious move: a literal sentinel string for the boundary

Claude Code’s prompt is built from ~15 contributing modules — base instructions, skills, environment hints, MCP-discovered tools, etc. Different teams own different parts. The prompt-as-a-whole has to be split into a static prefix (cacheable across all users) and a dynamic suffix (session-unique).

The split is implemented as a literal string sentinel: __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Anywhere in the prompt-building pipeline, find the sentinel, drop a cache_control breakpoint there, continue.

Why a string and not a typed marker object? Because the prompt flows through JSON serialization, template literals, network round-trips. A string survives every transform. A marker object would require every contributor to use a typed builder.

Latched sticky flags — the cache-vs-UX trade

Some flags live in the cacheable region of the system prompt: a fast-mode hint, an AFK mode, a debug toggle. If a user toggles one mid-session, the prompt prefix changes and the entire 50–70K-token cache is invalidated. The next turn pays full input cost.

Claude Code “latches” these flags: once on, they stay on for the rest of the session even if the user toggles them back off.

The user-visible side: “I turned off X but the prompt still says X is on” — surprising. The cost-visible side: 50K cached tokens stay valid all session — predictable.

The team accepted the surprise to keep the cache hot. Generalizing: when caching is load-bearing, treat user-controllable mutations as cache-design problems, not just UX problems. See latched sticky flags.

Memory compression rewrites the message history. Whatever cache breakpoints touched the rewritten region are now invalid.

You feel this on the call after a compaction event: 80K cached tokens gone, full input price applies. There are two coping strategies:

  • Compact rarely, on predictable boundaries. Turn 50, 100, 150. Accept the spike, plan around it.
  • Move compaction to before breakpoint 2. The new summary becomes part of the session-static region; only the (small) tail of recent turns sits in the dynamic suffix. Re-warming costs only the tail tokens, not everything.

Either way, track cache hit rate as a first-class metric. When it dips below threshold, alert. Most teams discover prompt-caching pathologies via a billing surprise; the metric catches it sooner.

What goes where

RegionStabilityWhat goes here
Before BP1 (org-static)invariant for weekscore instructions, tool schemas, allowed-skills list, regulatory preamble
BP1 to BP2 (session-static)invariant for the sessionuser/tenant identity, scope/auth, system clock at start, session config
After BP2 (turn-dynamic)varies per turnconversation history, last tool result, real-time clock, dynamic memory recall

Provider-managed (OpenAI-style) caching

OpenAI auto-caches on prefixes ≥1024 tokens with no cache_control markers. You pay ~50% on cached tokens. Less control, less effort.

? Anthropic explicit caching, or OpenAI auto-caching?
  • Anthropic-only stack Explicit cache_control. Best ratios, full control. best
  • Multi-provider, similar workloads Auto-caching where available; lose ~40% of the savings, gain portability.
  • Provider mixed across operations (main + auxiliary) Per-provider strategy.
  • Prototype / cost not yet a concern Skip. Ship first.

Recommended default: Anthropic explicit if you can. The 4 breakpoints + sentinel pattern is small to maintain and earns the largest savings.

The math

A typical Claude Code session: system prompt ≈ 30K tokens, history grows to ~50K, total ≈ 80K input/turn.

Scenario$ per turn (Sonnet 4.6 input pricing)
No caching80K × $3/M = $0.240
Cached system, cold history30K × $0.30/M + 50K × $3/M = $0.159
Cached system + cached history80K × $0.30/M = $0.024

About a 10× cost reduction at steady state. This is the line item that decides whether your product margin works.

Projects that implement this

  • Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
  • OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
  • OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
  • Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
  • NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
  • OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
  • Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.