TL;DR 8 min read

A typical agent session is dominated by repeating the same 30K-token system prompt and slowly-growing 50K-token history. Marking these as cacheable buys you a ~10× steady-state cost reduction. The interesting decision isn’t “should I cache” — it’s where to draw the boundary between cacheable and per-turn content. Get the boundary right and you spend $0.02 per turn; get it wrong and you spend $0.24.

Prompt caching

Modern LLM APIs let you mark prefix segments as cacheable: future calls with the same prefix pay a fraction of the input cost. Anthropic charges ~10% for cached tokens; OpenAI ~50%. On a long agent session this is the difference between a sustainable product and one that bleeds money on every turn.

The interesting bit isn’t whether to cache. It’s where you draw the boundary between static (cacheable) and dynamic (per-turn) content.

Anatomy of a cached agent prompt

flowchart TB
subgraph SP[System prompt]
  direction LR
  BG[base instructions]
  SK[skills / tool list]
  BD[bounds / scope]
end
subgraph DYN[Per-session prefix]
  UM[user / tenant identity]
  TS[session start time]
end
subgraph H[Conversation history]
  H1[turn 1] --> H2[turn 2] --> HN[turn N]
end
SP --> CB1[cache breakpoint · org-static]
CB1 --> DYN
DYN --> CB2[cache breakpoint · session-static]
CB2 --> H
H --> Now[Current message]
class CB1,CB2 cb

Two breakpoints split the prompt into org-static, session-static, and turn-dynamic regions.

Breakpoint 1 makes the system prompt reusable across all sessions — even cold-starts can hit a cached prefix. Breakpoint 2 makes it reusable across turns within one session. After the second turn, steady-state cost is dominated by what comes after breakpoint 2.

The non-obvious move: a literal sentinel string for the boundary

Claude Code’s prompt is built from ~15 contributing modules — base instructions, skills, environment hints, MCP-discovered tools, etc. Different teams own different parts. The prompt-as-a-whole has to be split into a static prefix (cacheable across all users) and a dynamic suffix (session-unique).

The split is implemented as a literal string sentinel: __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__. Anywhere in the prompt-building pipeline, find the sentinel, drop a cache_control breakpoint there, continue.

Why a string and not a typed marker object? Because the prompt flows through JSON serialization, template literals, network round-trips. A string survives every transform. A marker object would require every contributor to use a typed builder.

Latched sticky flags — the cache-vs-UX trade

Some flags live in the cacheable region of the system prompt: a fast-mode hint, an AFK mode, a debug toggle. If a user toggles one mid-session, the prompt prefix changes and the entire 50–70K-token cache is invalidated. The next turn pays full input cost.

Claude Code “latches” these flags: once on, they stay on for the rest of the session even if the user toggles them back off.

The user-visible side: “I turned off X but the prompt still says X is on” — surprising. The cost-visible side: 50K cached tokens stay valid all session — predictable.

The team accepted the surprise to keep the cache hot. Generalizing: when caching is load-bearing, treat user-controllable mutations as cache-design problems, not just UX problems. See latched sticky flags.

Memory compression rewrites the message history. Whatever cache breakpoints touched the rewritten region are now invalid.

You feel this on the call after a compaction event: 80K cached tokens gone, full input price applies. There are two coping strategies:

Compact rarely, on predictable boundaries. Turn 50, 100, 150. Accept the spike, plan around it.
Move compaction to before breakpoint 2. The new summary becomes part of the session-static region; only the (small) tail of recent turns sits in the dynamic suffix. Re-warming costs only the tail tokens, not everything.

Either way, track cache hit rate as a first-class metric. When it dips below threshold, alert. Most teams discover prompt-caching pathologies via a billing surprise; the metric catches it sooner.

What goes where

Region	Stability	What goes here
Before BP1 (org-static)	invariant for weeks	core instructions, tool schemas, allowed-skills list, regulatory preamble
BP1 to BP2 (session-static)	invariant for the session	user/tenant identity, scope/auth, system clock at start, session config
After BP2 (turn-dynamic)	varies per turn	conversation history, last tool result, real-time clock, dynamic memory recall

Provider-managed (OpenAI-style) caching

OpenAI auto-caches on prefixes ≥1024 tokens with no cache_control markers. You pay ~50% on cached tokens. Less control, less effort.

? Anthropic explicit caching, or OpenAI auto-caching?

Anthropic-only stack Explicit cache_control. Best ratios, full control. best
Multi-provider, similar workloads Auto-caching where available; lose ~40% of the savings, gain portability.
Provider mixed across operations (main + auxiliary) Per-provider strategy.
Prototype / cost not yet a concern Skip. Ship first.

Recommended default: Anthropic explicit if you can. The 4 breakpoints + sentinel pattern is small to maintain and earns the largest savings.

The math

A typical Claude Code session: system prompt ≈ 30K tokens, history grows to ~50K, total ≈ 80K input/turn.

Scenario	$ per turn (Sonnet 4.6 input pricing)
No caching	80K × $3/M = $0.240
Cached system, cold history	30K × $0.30/M + 50K × $3/M = $0.159
Cached system + cached history	80K × $0.30/M = $0.024

About a 10× cost reduction at steady state. This is the line item that decides whether your product margin works.

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.

Claude Code ●●●

Cache boundary as a literal sentinel string

Survives every refactor, no marker objects to remember to add. Lets dozens of contributors compose one prompt without breaking cross-org cache reuse.

prompt-caching

Claude Code ●●●

Thinking signature preservation across turns (and stripping on model switch)

Lose this and the model re-thinks every turn (cost spike) or you crash on model switch with an opaque signature error.

extended-thinking prompt-caching agent-loop

Claude Code ●●●

Latched sticky flags for cache coherence

User-toggleable flags that live in the system prompt would bust 50-70K cache tokens on every toggle. Latching trades UX flexibility for a 10× cost cut.

prompt-caching

Prompt caching

Projects that implement this

Related insights