4. The Agent Loop and LLM Usage
This is the most interesting doc if you want to learn from this codebase. The agent loop at query.ts:219 and the API layer at services/api/claude.ts together contain the bulk of the system's "intelligence about intelligence" — how it drives the model safely, cheaply, and reliably.
4.1 Shape of the loop
query() at query.ts:219 is a thin wrapper around the actual generator queryLoop(). It is an async generator that yields SDKMessage events and maintains all state via reassignment in a while (true) block. The generator never calls itself — recursion is modeled by reassigning a State object and looping.
Key fields on State (introduced at query.ts:203):
messages: Message[]— the running conversation.toolUseContext— per-turn ambient: model, tools, mcpClients, agent defs, readFileState, abort, AppState setters, etc.autoCompactTracking—{ compacted, turnId, turnCounter, consecutiveFailures }(circuit-breaks compaction after 3 failures,services/compact/autoCompact.ts:70).maxOutputTokensRecoveryCount,hasAttemptedReactiveCompact,maxOutputTokensOverride— recovery-branch guards.pendingToolUseSummary— a promise that generates a cheap haiku summary during the model stream (overlaps the 1s haiku with the 5-30s main model call).turnCount— incremented atquery.ts:1679before re-entry.transition— a{ reason: string }object explaining why the last iteration continued. Tests assert on this without inspecting message content. Enumerated values fromquery.ts:1110,1162,1217,1246,1302,1338,1725:'next_turn'— normal continuation after tool results.'collapse_drain_retry'— context-collapse recovery.'reactive_compact_retry'— reactive compaction after a 413 PTL.'max_output_tokens_escalate'— raised 8k → 64k and retried.'max_output_tokens_recovery'— multi-turn recovery loop.'stop_hook_blocking'— hook error requires retry.'token_budget_continuation'— auto-continue on budget headroom.
Every loop iteration is one of: continue (reassign state), yield (emit a message), or return (terminal reason).
4.2 Streaming and tool dispatch
Streaming
The model is invoked via deps.callModel() (query.ts:659) which wraps queryModel() in claude.ts:1017. It is an async generator; the loop consumes it with a for await. As blocks arrive:
- Text blocks go straight to the UI (
yieldthe assistant message). - Thinking blocks are yielded alongside.
- A
tool_useblock is pushed intotoolUseBlocks(query.ts:829-834) and setsneedsFollowUp = true.
The loop does not rely on stop_reason to decide whether to continue (see the comment at query.ts:554). Instead it checks needsFollowUp: if the assistant emitted any tool_use, we need to feed results back.
Parallel tool execution
If streamingToolExecution is enabled (query.ts:561), a StreamingToolExecutor (query.ts:563-568) starts tools while the model is still streaming. Completed tool results are yielded immediately (query.ts:851-862) and accumulate in toolResults. This lets independent calls run in parallel — a substantial latency win when the model emits multiple tool_use blocks in one turn.
Tools declare whether they can run in parallel with themselves via isConcurrencySafe(input) (Tool.ts:402). Tools declare reaction to interrupt via interruptBehavior() — 'cancel' to abort on a new user message, 'block' to finish first (Tool.ts:411-416).
Fallback and orphan tombstoning
If streaming throws FallbackTriggeredError (query.ts:894), the loop:
- Tombstones orphan assistant messages — any in-progress
tool_usegets a synthetictool_resultblock reading "Model fallback triggered" (query.ts:900-903). This is critical: unpairedtool_useblocks would make subsequent API calls fail with a 400 ("tool_use without tool_result"). - Clears the streaming executor and recreates it (
query.ts:734-739) — otherwise a stale executor could emit a tool_result for a tool_use that was never logged. - Switches current model (
query.ts:896), retries withattemptWithFallback=true(query.ts:897). - Strips thinking signatures for non-matching fallback models (
query.ts:928) — a signed thinking block from model A will 400 on model B ("thinking blocks cannot be modified").
4.3 Thinking (extended thinking)
Extended thinking has strict API invariants that the loop honors (query.ts:151-163):
- A thinking block must be part of a query with
max_thinking_length > 0. - A thinking block may not be the last block in a content array.
- Thinking blocks must be preserved for the entire assistant trajectory (the turn, or if tool_use blocks exist, also the tool_results and the following assistant message).
Adaptive vs budgeted
claude.ts:1596-1629:
- If thinking is globally disabled, skip.
- If
modelSupportsAdaptiveThinking(model)and not explicitly disabled (CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING), send{ type: 'adaptive' }— the model picks how much to think. - Otherwise budgeted:
{ type: 'enabled', budget_tokens: … }where the budget defaults togetMaxThinkingTokensForModel(model)and is clamped tomaxOutputTokens - 1.
Adaptive is the default on Opus/Sonnet 4.6+ (utils/thinking.ts:120-129 allowlists; 113-144 describes the decision tree).
Signature handling across turns
Thinking signatures are preserved intact across turns in the conversation (so the API validates them on re-submission). The only time they're stripped is on a mid-run model switch (query.ts:928) — because signatures are model-bound. services/api/claude.ts:1632-1637 also describes a sticky "clear all thinking" latch (thinkingClearLatched) that trips if 1+ hour has passed since the last API call — thinking chain is too stale to be productive.
4.4 API client and retry (services/api/claude.ts, services/api/withRetry.ts)
Streaming setup
queryModel() at claude.ts:1017 calls withRetry() which creates an SDK stream via anthropic.beta.messages.create({..., stream: true}). The stream is consumed event-by-event. paramsFromContext() (claude.ts:1538-1729) computes beta headers, thinking config, temperature, effort, prompt-cache setup per call.
Retry classification (services/api/withRetry.ts:170-440)
Carefully domain-specific:
| Symptom | Response |
|---|---|
| 401 / 403 revoked | Force client refresh (withRetry.ts:240-250). |
ECONNRESET / EPIPE |
Disable keep-alive and reconnect (218-229). |
| 429 / 529 in fast mode, short retry-after (<1s) | Retry in fast mode to preserve cache. Long retry-after → enter cooldown, switch to standard speed (267-305). |
| Fast mode not enabled by org | Disable fast mode and retry (310-314). |
| 529 with fallback model | Count consecutive 529s; after 3 throw FallbackTriggeredError for the caller to model-switch (331-350). |
| 429 for non-foreground source | Bail without retry (318-324). |
| Other retryable | Backoff up to 10 retries (179). |
CLAUDE_CODE_UNATTENDED_RETRY (ant-only): 429/529 retry indefinitely with up to 5-min backoff, 30s keep-alive heartbeats so idle-kills don't land (withRetry.ts:96-104).
Prompt caching
Toggled per-model via getPromptCachingEnabled() (claude.ts:333-356). Scope is 'ephemeral' by default; upgraded to ttl: '1h' for eligible users (358-434), and scope: 'global' when no MCP tools require a dynamic boundary — global scope means the cache is shared across organizations for truly static content.
The static/dynamic boundary is a literal sentinel: SYSTEM_PROMPT_DYNAMIC_BOUNDARY = '__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__' (constants/prompts.ts:114-115). Everything BEFORE this string in the system prompt can use scope: 'global'. Everything AFTER contains user/session-specific content. The splitter lives in utils/api.ts:splitSysPromptPrefix and services/api/claude.ts:buildSystemPromptBlocks.
Latched flags to preserve cache
claude.ts:1405-1442 introduces sticky-on latches for flags whose flip would bust ~50-70K cache tokens: fast mode, AFK mode, cache editing, thinking-clear. Once on, they stay on for the session so mid-session toggles don't invalidate the prefix. Per-call dynamic flags (isAgenticQuery, querySource) are isolated past the boundary.
Beta header assembly
Headers change with feature mode. getMergedBetas() pulls them together: context-1m-2025-08-07, fast-mode, AFK, cache-editing, tool-search, 1M context, effort, structured outputs, task budgets. Every flag that might appear in a cached prefix is deliberately evaluated before the boundary marker.
4.5 System-prompt composition
constants/prompts.ts:444 exposes getSystemPrompt(tools, model, additionalWorkingDirectories?, mcpClients?) which returns string[] (joined with \n by the API layer). When CLAUDE_CODE_SIMPLE=1 it returns just:
You are Claude Code, Anthropic's official CLI for Claude.
CWD: <cwd>
Date: <date>
Otherwise it assembles two halves:
[STATIC — cacheable with scope:'global']
getSimpleIntroSection(outputStyleConfig)
getSimpleSystemSection()
[getSimpleDoingTasksSection()] // unless output style suppresses it
getActionsSection()
getUsingYourToolsSection(enabledTools)
getSimpleToneAndStyleSection()
getOutputEfficiencySection()
SYSTEM_PROMPT_DYNAMIC_BOUNDARY // literal sentinel string
[DYNAMIC — per-session, resolved via registry]
session_guidance
memory (loadMemoryPrompt)
ant_model_override
env_info_simple // cwd, git state, platform, model, knowledge cutoff
language // if user set a language pref
output_style
mcp_instructions
scratchpad
frc (function result clearing)
summarize_tool_results
[numeric_length_anchors] // ant-only; 1.2% output-token reduction per eval
[token_budget] // if feature('TOKEN_BUDGET')
[brief] // if KAIROS feature
The sections are implemented as systemPromptSection('name', () => promise-of-text) memoized until /clear or /compact. A small number of sections that genuinely need per-turn recomputation use DANGEROUS_uncachedSystemPromptSection (constants/prompts.ts:513).
Tool descriptions
Tool schemas are built from enabled tools (claude.ts:1235-1246) via toolToAPISchema() with a defer_loading: true flag for deferred tools. When deferred-tool delta attachment is off, the list of deferred tools is injected as a synthetic user message with an XML-tagged block (claude.ts:1330-1345).
User and system context
utils/queryContext.ts:fetchSystemPromptParts() returns { defaultSystemPrompt, userContext, systemContext }:
userContext— cwd, git branch, git status, shell environment, recent files. Prepended as a user message viaprependUserContext()(utils/api.ts).systemContext— OS type + release, appended after the system prompt.
A getCoordinatorUserContext bundle is added when feature('COORDINATOR_MODE') is on (QueryEngine.ts:111-118,302-308).
4.6 Compaction
services/compact/autoCompact.ts:shouldAutoCompact() decides each turn:
- Skip if the current query source is itself compaction-related (
session_memory,compact,marble_origami) — recursion guard. - Skip if
DISABLE_COMPACT/DISABLE_AUTO_COMPACTenv or user settingautoCompactEnabled: false. - In reactive-only mode (
tengu_cobalt_raccoon), defer to 413 handling. - In context-collapse mode, collapse owns the budget (90% commit / 95% blocking).
- Otherwise compact when
tokenCountWithEstimation(messages) - snipTokensFreed ≥ getAutoCompactThreshold(model).
getAutoCompactThreshold() = effectiveContextWindow - AUTOCOMPACT_BUFFER_TOKENS, where effectiveContextWindow = contextWindow - MAX_OUTPUT_TOKENS_FOR_SUMMARY (20K). Buffer = 13K by default, overridable.
The compaction process (services/compact/compact.ts:387)
- Capture pre-compact state: token count, app state, permission context.
- Run pre-compact hooks (user can inject custom instructions).
- Strip images and reinjected attachments from the messages to summarize.
- Call
streamCompactSummary()with the compaction prompt (services/compact/prompt.ts). The compaction prompt is worth quoting — it is reproduced in doc 07. - On
PROMPT_TOO_LONG_ERROR_MESSAGE, truncate the oldest API-round groups and retry up to 3 times. - Clear the read-file cache; clear nested memory paths.
- Generate post-compact attachments:
- Up to 5 most-recently-used files under 5K tokens each / 50K total.
- Deferred tools delta (re-announce tools discovered via ToolSearch).
- Agent listing delta (re-announce available agents).
- MCP instructions delta (re-announce connected servers).
- Plan attachment and plan-mode instructions if in plan mode.
- Skill attachment if skills have been invoked.
- Run post-compact session-start hooks.
- Emit a
createCompactBoundaryMessagewith metadata includingpreservedSegment.tailUuid(so a mid-compact crash still has a recoverable boundary). - Emit a user-visible summary message with
isCompactSummary: true, isVisibleInTranscriptOnly: true. - Log a
tengu_compactevent with true post-compact context size.
The RecompactionInfo type (compact.ts:317-323) carries diagnostic fields (isRecompactionInChain, turnsSincePreviousCompact, previousCompactTurnId, autoCompactThreshold, querySource) to distinguish same-chain recompaction from cross-agent or manual triggers.
4.7 Recovery branches you should know about
These are the transition.reason values, all implemented as explicit branches inside query.ts:
| Reason | Trigger | Handling |
|---|---|---|
next_turn |
Tool results available | Merge messages + assistant responses + tool results; re-enter loop. |
collapse_drain_retry |
Context collapse drained | Re-enter with narrower context. |
reactive_compact_retry |
Got 413 PTL despite threshold check | Compact and retry the same request. |
max_output_tokens_escalate |
Hit max_output_tokens on first try | Raise 8k → 64k, retry. |
max_output_tokens_recovery |
Still truncating at 64k | Enter a multi-turn "continue" loop so the model completes its thought. |
stop_hook_blocking |
A user Stop hook blocked the final message | Surface the block to the model, retry. |
token_budget_continuation |
User-specified token budget not yet met | Auto-continue without asking. |
Each reassigns State and loops; no exceptions escape the loop unless maxTurns is exceeded (query.ts:1705).
4.8 Stop hooks
query/stopHooks.ts runs when the model stops without a tool_use. Hooks can inject user-prompt-submit messages, veto the stop (triggering stop_hook_blocking), or run memory extraction. The auto-memory extraction path (services/extractMemories/) is a stop hook that runs a fork of the conversation to write memories to disk when a turn completes successfully.
4.9 Cost and usage tracking
cost-tracker.ts accumulates per-model cost, token counts, and API duration. accumulateUsage/updateUsage from services/api/claude.ts track cache hit/miss, cache-creation tokens, web-search requests. Final result SDK messages include total_cost_usd, usage, modelUsage, permission_denials, fast_mode_state, duration_ms, duration_api_ms, num_turns (QueryEngine.ts:618-637).