TL;DR 6 min read

Modern Claude and OpenAI models can think internally before answering. With Anthropic, that thinking is a structured thinking content block alongside the answer, with an HMAC signature you must preserve on subsequent turns. Two specific gotchas live here: signatures are model-bound (fail if you switch models mid-session), and thinking blocks count against your input budget and your cache layout on later turns. Get both right or lose continuity, money, and your patience.

Extended thinking

The model thinks, then answers. You pay for the thinking — Anthropic shows it to you partially (with a signature), OpenAI bills it but hides it. Either way, two non-obvious behaviours bite people who don’t read the docs carefully.

Anthropic — thinking blocks have signatures

When Claude is called with thinking enabled, the assistant’s response carries a thinking content block alongside its text and tool blocks:

{
  "role": "assistant",
  "content": [
    { "type": "thinking", "thinking": "...long internal reasoning...", "signature": "abc123..." },
    { "type": "text", "text": "Here's my answer..." }
  ]
}

The signature is an HMAC bound to the model. To pass thinking back on the next turn, you must include the signature too — Anthropic verifies it server-side to ensure thinking blocks aren’t forged.

Adaptive vs budgeted thinking

Two flavors:

Budgeted. You cap the thinking at, say, 4096 tokens. Predictable cost, predictable latency.
Adaptive. The model decides how deep to go. More expensive on hard problems, cheaper on easy. Variance.

Claude Code defaults to adaptive but exposes a budgeted mode. Strix uses three tiers (quick, medium, high) chosen per scan-mode.

Stale thinking

Thinking from an hour ago is suspect — the world has moved on, the agent has moved on, and re-anchoring on outdated reasoning hurts more than it helps. Some implementations clear thinking blocks past a TTL even when the signatures are still valid. Claude Code uses a 1-hour cutoff.

Caching interaction

Thinking blocks count against your input budget on subsequent turns (you’re feeding them back). They also affect cache hit rate: a long reasoning block in turn 5 inflates turn 6’s prefix.

Two strategies:

Don’t cache thinking. Place cache breakpoints before the thinking blocks of historical turns, so old thinking sits in the dynamic suffix and doesn’t anchor your cacheable region.
Drop thinking from history aggressively. Keep it only for the most recent N turns; everything older becomes prose-only.

OpenAI o-series reasoning

OpenAI’s reasoning models (o1, o3) handle reasoning server-side. You pay for reasoning_tokens (visible in usage) but never see the content. There’s no signature, no cross-turn submission, no stripping logic.

Less to manage, less to mess up, less to leverage. You can’t re-anchor reasoning across turns because there’s nothing to re-submit.

Should you turn thinking on?

? Is your agent doing multi-step work or single-shot tasks?

Real engineering — coding, math, multi-step strategy Always-on thinking. Worth the tokens.
Mixed simple/complex, want adaptive cost On-demand or adaptive thinking per task.
Latency-sensitive, simple bots Off. Thinking adds wall-clock you don't want.
Audit / compliance flow On — but persist thinking to your audit store, not just the live transcript. AI-Act

Recommended default: Default to on for any agent that does real engineering. Turn off only when you have a specific cost or latency reason.

Common pitfalls

Pitfall	Symptom	Fix
Strip thinking when re-sending history	model rethinks every turn, cost spikes	preserve thinking + signature for own turns
Send thinking after a model switch	API: “signature invalid”	drop thinking on model swap
Thinking blocks invalidating cache	hit rate drops to ~0	place cache_control before historical thinking
Show thinking to user inline	confusing UX	”thinking…” spinner; reveal on demand

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.

Claude Code ●●●

Thinking signature preservation across turns (and stripping on model switch)

Lose this and the model re-thinks every turn (cost spike) or you crash on model switch with an opaque signature error.

extended-thinking prompt-caching agent-loop

Extended thinking

Projects that implement this

Related insights