Modern Claude and OpenAI models can think internally before answering. With Anthropic, that thinking is a structured thinking content block alongside the answer, with an HMAC signature you must preserve on subsequent turns. Two specific gotchas live here: signatures are model-bound (fail if you switch models mid-session), and thinking blocks count against your input budget and your cache layout on later turns. Get both right or lose continuity, money, and your patience.
Extended thinking
The model thinks, then answers. You pay for the thinking — Anthropic shows it to you partially (with a signature), OpenAI bills it but hides it. Either way, two non-obvious behaviours bite people who don’t read the docs carefully.
Anthropic — thinking blocks have signatures
When Claude is called with thinking enabled, the assistant’s response carries a thinking content block alongside its text and tool blocks:
{
"role": "assistant",
"content": [
{ "type": "thinking", "thinking": "...long internal reasoning...", "signature": "abc123..." },
{ "type": "text", "text": "Here's my answer..." }
]
}
The signature is an HMAC bound to the model. To pass thinking back on the next turn, you must include the signature too — Anthropic verifies it server-side to ensure thinking blocks aren’t forged.
Adaptive vs budgeted thinking
Two flavors:
- Budgeted. You cap the thinking at, say, 4096 tokens. Predictable cost, predictable latency.
- Adaptive. The model decides how deep to go. More expensive on hard problems, cheaper on easy. Variance.
Claude Code defaults to adaptive but exposes a budgeted mode. Strix uses three tiers (quick, medium, high) chosen per scan-mode.
Stale thinking
Thinking from an hour ago is suspect — the world has moved on, the agent has moved on, and re-anchoring on outdated reasoning hurts more than it helps. Some implementations clear thinking blocks past a TTL even when the signatures are still valid. Claude Code uses a 1-hour cutoff.
Caching interaction
Thinking blocks count against your input budget on subsequent turns (you’re feeding them back). They also affect cache hit rate: a long reasoning block in turn 5 inflates turn 6’s prefix.
Two strategies:
- Don’t cache thinking. Place cache breakpoints before the thinking blocks of historical turns, so old thinking sits in the dynamic suffix and doesn’t anchor your cacheable region.
- Drop thinking from history aggressively. Keep it only for the most recent N turns; everything older becomes prose-only.
OpenAI o-series reasoning
OpenAI’s reasoning models (o1, o3) handle reasoning server-side. You pay for reasoning_tokens (visible in usage) but never see the content. There’s no signature, no cross-turn submission, no stripping logic.
Less to manage, less to mess up, less to leverage. You can’t re-anchor reasoning across turns because there’s nothing to re-submit.
Should you turn thinking on?
- Real engineering — coding, math, multi-step strategy Always-on thinking. Worth the tokens.
- Mixed simple/complex, want adaptive cost On-demand or adaptive thinking per task.
- Latency-sensitive, simple bots Off. Thinking adds wall-clock you don't want.
- Audit / compliance flow On — but persist thinking to your audit store, not just the live transcript. AI-Act
Recommended default: Default to on for any agent that does real engineering. Turn off only when you have a specific cost or latency reason.
Common pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Strip thinking when re-sending history | model rethinks every turn, cost spikes | preserve thinking + signature for own turns |
| Send thinking after a model switch | API: “signature invalid” | drop thinking on model swap |
| Thinking blocks invalidating cache | hit rate drops to ~0 | place cache_control before historical thinking |
| Show thinking to user inline | confusing UX | ”thinking…” spinner; reveal on demand |
Projects that implement this
- Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
- Hermes Agent — 40+ tool, multi-platform agent. Provider adapters per LLM, trajectory compression preserves first/last turns, side-channel auxiliary client.
Related insights
Lose this and the model re-thinks every turn (cost spike) or you crash on model switch with an opaque signature error.