A typical agent session is dominated by repeating the same 30K-token system prompt and slowly-growing 50K-token history. Marking these as cacheable buys you a ~10× steady-state cost reduction. The interesting decision isn’t “should I cache” — it’s where to draw the boundary between cacheable and per-turn content. Get the boundary right and you spend $0.02 per turn; get it wrong and you spend $0.24.
Prompt caching
Modern LLM APIs let you mark prefix segments as cacheable: future calls with the same prefix pay a fraction of the input cost. Anthropic charges ~10% for cached tokens; OpenAI ~50%. On a long agent session this is the difference between a sustainable product and one that bleeds money on every turn.
The interesting bit isn’t whether to cache. It’s where you draw the boundary between static (cacheable) and dynamic (per-turn) content.
Anatomy of a cached agent prompt
flowchart TB subgraph SP[System prompt] direction LR BG[base instructions] SK[skills / tool list] BD[bounds / scope] end subgraph DYN[Per-session prefix] UM[user / tenant identity] TS[session start time] end subgraph H[Conversation history] H1[turn 1] --> H2[turn 2] --> HN[turn N] end SP --> CB1[cache breakpoint · org-static] CB1 --> DYN DYN --> CB2[cache breakpoint · session-static] CB2 --> H H --> Now[Current message] class CB1,CB2 cb
Breakpoint 1 makes the system prompt reusable across all sessions — even cold-starts can hit a cached prefix. Breakpoint 2 makes it reusable across turns within one session. After the second turn, steady-state cost is dominated by what comes after breakpoint 2.
The non-obvious move: a literal sentinel string for the boundary
Claude Code’s prompt is built from ~15 contributing modules — base instructions, skills, environment hints, MCP-discovered tools, etc. Different teams own different parts. The prompt-as-a-whole has to be split into a static prefix (cacheable across all users) and a dynamic suffix (session-unique).
The split is implemented as a literal string sentinel: __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Anywhere in the prompt-building pipeline, find the sentinel, drop a cache_control breakpoint there, continue.
Why a string and not a typed marker object? Because the prompt flows through JSON serialization, template literals, network round-trips. A string survives every transform. A marker object would require every contributor to use a typed builder.
Latched sticky flags — the cache-vs-UX trade
Some flags live in the cacheable region of the system prompt: a fast-mode hint, an AFK mode, a debug toggle. If a user toggles one mid-session, the prompt prefix changes and the entire 50–70K-token cache is invalidated. The next turn pays full input cost.
Claude Code “latches” these flags: once on, they stay on for the rest of the session even if the user toggles them back off.
The user-visible side: “I turned off X but the prompt still says X is on” — surprising. The cost-visible side: 50K cached tokens stay valid all session — predictable.
The team accepted the surprise to keep the cache hot. Generalizing: when caching is load-bearing, treat user-controllable mutations as cache-design problems, not just UX problems. See latched sticky flags.
Compression and cache invalidation are inversely related
Memory compression rewrites the message history. Whatever cache breakpoints touched the rewritten region are now invalid.
You feel this on the call after a compaction event: 80K cached tokens gone, full input price applies. There are two coping strategies:
- Compact rarely, on predictable boundaries. Turn 50, 100, 150. Accept the spike, plan around it.
- Move compaction to before breakpoint 2. The new summary becomes part of the session-static region; only the (small) tail of recent turns sits in the dynamic suffix. Re-warming costs only the tail tokens, not everything.
Either way, track cache hit rate as a first-class metric. When it dips below threshold, alert. Most teams discover prompt-caching pathologies via a billing surprise; the metric catches it sooner.
What goes where
| Region | Stability | What goes here |
|---|---|---|
| Before BP1 (org-static) | invariant for weeks | core instructions, tool schemas, allowed-skills list, regulatory preamble |
| BP1 to BP2 (session-static) | invariant for the session | user/tenant identity, scope/auth, system clock at start, session config |
| After BP2 (turn-dynamic) | varies per turn | conversation history, last tool result, real-time clock, dynamic memory recall |
Provider-managed (OpenAI-style) caching
OpenAI auto-caches on prefixes ≥1024 tokens with no cache_control markers. You pay ~50% on cached tokens. Less control, less effort.
- Anthropic-only stack Explicit cache_control. Best ratios, full control. best
- Multi-provider, similar workloads Auto-caching where available; lose ~40% of the savings, gain portability.
- Provider mixed across operations (main + auxiliary) Per-provider strategy.
- Prototype / cost not yet a concern Skip. Ship first.
Recommended default: Anthropic explicit if you can. The 4 breakpoints + sentinel pattern is small to maintain and earns the largest savings.
The math
A typical Claude Code session: system prompt ≈ 30K tokens, history grows to ~50K, total ≈ 80K input/turn.
| Scenario | $ per turn (Sonnet 4.6 input pricing) |
|---|---|
| No caching | 80K × $3/M = $0.240 |
| Cached system, cold history | 30K × $0.30/M + 50K × $3/M = $0.159 |
| Cached system + cached history | 80K × $0.30/M = $0.024 |
About a 10× cost reduction at steady state. This is the line item that decides whether your product margin works.
Projects that implement this
- Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
- OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
- OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
- Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
- NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
- OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
- Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.
Related insights
Survives every refactor, no marker objects to remember to add. Lets dozens of contributors compose one prompt without breaking cross-org cache reuse.
Lose this and the model re-thinks every turn (cost spike) or you crash on model switch with an opaque signature error.
User-toggleable flags that live in the system prompt would bust 50-70K cache tokens on every toggle. Latching trades UX flexibility for a 10× cost cut.