4. The Agent Loop and LLM Usage

This is the most interesting doc if you want to learn from this codebase. The agent loop at query.ts:219 and the API layer at services/api/claude.ts together contain the bulk of the system's "intelligence about intelligence" — how it drives the model safely, cheaply, and reliably.

4.1 Shape of the loop

query() at query.ts:219 is a thin wrapper around the actual generator queryLoop(). It is an async generator that yields SDKMessage events and maintains all state via reassignment in a while (true) block. The generator never calls itself — recursion is modeled by reassigning a State object and looping.

Key fields on State (introduced at query.ts:203):

messages: Message[] — the running conversation.
toolUseContext — per-turn ambient: model, tools, mcpClients, agent defs, readFileState, abort, AppState setters, etc.
autoCompactTracking — { compacted, turnId, turnCounter, consecutiveFailures } (circuit-breaks compaction after 3 failures, services/compact/autoCompact.ts:70).
maxOutputTokensRecoveryCount, hasAttemptedReactiveCompact, maxOutputTokensOverride — recovery-branch guards.
pendingToolUseSummary — a promise that generates a cheap haiku summary during the model stream (overlaps the 1s haiku with the 5-30s main model call).
turnCount — incremented at query.ts:1679 before re-entry.
transition — a { reason: string } object explaining why the last iteration continued. Tests assert on this without inspecting message content. Enumerated values from query.ts:1110,1162,1217,1246,1302,1338,1725:
- 'next_turn' — normal continuation after tool results.
- 'collapse_drain_retry' — context-collapse recovery.
- 'reactive_compact_retry' — reactive compaction after a 413 PTL.
- 'max_output_tokens_escalate' — raised 8k → 64k and retried.
- 'max_output_tokens_recovery' — multi-turn recovery loop.
- 'stop_hook_blocking' — hook error requires retry.
- 'token_budget_continuation' — auto-continue on budget headroom.

Every loop iteration is one of: continue (reassign state), yield (emit a message), or return (terminal reason).

4.2 Streaming and tool dispatch

Streaming

The model is invoked via deps.callModel() (query.ts:659) which wraps queryModel() in claude.ts:1017. It is an async generator; the loop consumes it with a for await. As blocks arrive:

Text blocks go straight to the UI (yield the assistant message).
Thinking blocks are yielded alongside.
A tool_use block is pushed into toolUseBlocks (query.ts:829-834) and sets needsFollowUp = true.

The loop does not rely on stop_reason to decide whether to continue (see the comment at query.ts:554). Instead it checks needsFollowUp: if the assistant emitted any tool_use, we need to feed results back.

Parallel tool execution

If streamingToolExecution is enabled (query.ts:561), a StreamingToolExecutor (query.ts:563-568) starts tools while the model is still streaming. Completed tool results are yielded immediately (query.ts:851-862) and accumulate in toolResults. This lets independent calls run in parallel — a substantial latency win when the model emits multiple tool_use blocks in one turn.

Tools declare whether they can run in parallel with themselves via isConcurrencySafe(input) (Tool.ts:402). Tools declare reaction to interrupt via interruptBehavior() — 'cancel' to abort on a new user message, 'block' to finish first (Tool.ts:411-416).

Fallback and orphan tombstoning

If streaming throws FallbackTriggeredError (query.ts:894), the loop:

Tombstones orphan assistant messages — any in-progress tool_use gets a synthetic tool_result block reading "Model fallback triggered" (query.ts:900-903). This is critical: unpaired tool_use blocks would make subsequent API calls fail with a 400 ("tool_use without tool_result").
Clears the streaming executor and recreates it (query.ts:734-739) — otherwise a stale executor could emit a tool_result for a tool_use that was never logged.
Switches current model (query.ts:896), retries with attemptWithFallback=true (query.ts:897).
Strips thinking signatures for non-matching fallback models (query.ts:928) — a signed thinking block from model A will 400 on model B ("thinking blocks cannot be modified").

4.3 Thinking (extended thinking)

Extended thinking has strict API invariants that the loop honors (query.ts:151-163):

A thinking block must be part of a query with max_thinking_length > 0.
A thinking block may not be the last block in a content array.
Thinking blocks must be preserved for the entire assistant trajectory (the turn, or if tool_use blocks exist, also the tool_results and the following assistant message).

Adaptive vs budgeted

claude.ts:1596-1629:

If thinking is globally disabled, skip.
If modelSupportsAdaptiveThinking(model) and not explicitly disabled (CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING), send { type: 'adaptive' } — the model picks how much to think.
Otherwise budgeted: { type: 'enabled', budget_tokens: … } where the budget defaults to getMaxThinkingTokensForModel(model) and is clamped to maxOutputTokens - 1.

Adaptive is the default on Opus/Sonnet 4.6+ (utils/thinking.ts:120-129 allowlists; 113-144 describes the decision tree).

Signature handling across turns

Thinking signatures are preserved intact across turns in the conversation (so the API validates them on re-submission). The only time they're stripped is on a mid-run model switch (query.ts:928) — because signatures are model-bound. services/api/claude.ts:1632-1637 also describes a sticky "clear all thinking" latch (thinkingClearLatched) that trips if 1+ hour has passed since the last API call — thinking chain is too stale to be productive.

4.4 API client and retry (`services/api/claude.ts`, `services/api/withRetry.ts`)

Streaming setup

queryModel() at claude.ts:1017 calls withRetry() which creates an SDK stream via anthropic.beta.messages.create({..., stream: true}). The stream is consumed event-by-event. paramsFromContext() (claude.ts:1538-1729) computes beta headers, thinking config, temperature, effort, prompt-cache setup per call.

Retry classification (`services/api/withRetry.ts:170-440`)

Carefully domain-specific:

Symptom	Response
401 / 403 revoked	Force client refresh (`withRetry.ts:240-250`).
`ECONNRESET` / `EPIPE`	Disable keep-alive and reconnect (`218-229`).
429 / 529 in fast mode, short retry-after (<1s)	Retry in fast mode to preserve cache. Long retry-after → enter cooldown, switch to standard speed (`267-305`).
Fast mode not enabled by org	Disable fast mode and retry (`310-314`).
529 with fallback model	Count consecutive 529s; after 3 throw `FallbackTriggeredError` for the caller to model-switch (`331-350`).
429 for non-foreground source	Bail without retry (`318-324`).
Other retryable	Backoff up to 10 retries (`179`).

CLAUDE_CODE_UNATTENDED_RETRY (ant-only): 429/529 retry indefinitely with up to 5-min backoff, 30s keep-alive heartbeats so idle-kills don't land (withRetry.ts:96-104).

Prompt caching

Toggled per-model via getPromptCachingEnabled() (claude.ts:333-356). Scope is 'ephemeral' by default; upgraded to ttl: '1h' for eligible users (358-434), and scope: 'global' when no MCP tools require a dynamic boundary — global scope means the cache is shared across organizations for truly static content.

The static/dynamic boundary is a literal sentinel: SYSTEM_PROMPT_DYNAMIC_BOUNDARY = '__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__' (constants/prompts.ts:114-115). Everything BEFORE this string in the system prompt can use scope: 'global'. Everything AFTER contains user/session-specific content. The splitter lives in utils/api.ts:splitSysPromptPrefix and services/api/claude.ts:buildSystemPromptBlocks.

Latched flags to preserve cache

claude.ts:1405-1442 introduces sticky-on latches for flags whose flip would bust ~50-70K cache tokens: fast mode, AFK mode, cache editing, thinking-clear. Once on, they stay on for the session so mid-session toggles don't invalidate the prefix. Per-call dynamic flags (isAgenticQuery, querySource) are isolated past the boundary.

Beta header assembly

Headers change with feature mode. getMergedBetas() pulls them together: context-1m-2025-08-07, fast-mode, AFK, cache-editing, tool-search, 1M context, effort, structured outputs, task budgets. Every flag that might appear in a cached prefix is deliberately evaluated before the boundary marker.

4.5 System-prompt composition

constants/prompts.ts:444 exposes getSystemPrompt(tools, model, additionalWorkingDirectories?, mcpClients?) which returns string[] (joined with \n by the API layer). When CLAUDE_CODE_SIMPLE=1 it returns just:

You are Claude Code, Anthropic's official CLI for Claude.

CWD: <cwd>
Date: <date>

Otherwise it assembles two halves:

[STATIC — cacheable with scope:'global']
  getSimpleIntroSection(outputStyleConfig)
  getSimpleSystemSection()
  [getSimpleDoingTasksSection()]                  // unless output style suppresses it
  getActionsSection()
  getUsingYourToolsSection(enabledTools)
  getSimpleToneAndStyleSection()
  getOutputEfficiencySection()

  SYSTEM_PROMPT_DYNAMIC_BOUNDARY                  // literal sentinel string

[DYNAMIC — per-session, resolved via registry]
  session_guidance
  memory (loadMemoryPrompt)
  ant_model_override
  env_info_simple                                 // cwd, git state, platform, model, knowledge cutoff
  language                                        // if user set a language pref
  output_style
  mcp_instructions
  scratchpad
  frc (function result clearing)
  summarize_tool_results
  [numeric_length_anchors]                        // ant-only; 1.2% output-token reduction per eval
  [token_budget]                                  // if feature('TOKEN_BUDGET')
  [brief]                                         // if KAIROS feature

The sections are implemented as systemPromptSection('name', () => promise-of-text) memoized until /clear or /compact. A small number of sections that genuinely need per-turn recomputation use DANGEROUS_uncachedSystemPromptSection (constants/prompts.ts:513).

Tool descriptions

Tool schemas are built from enabled tools (claude.ts:1235-1246) via toolToAPISchema() with a defer_loading: true flag for deferred tools. When deferred-tool delta attachment is off, the list of deferred tools is injected as a synthetic user message with an XML-tagged block (claude.ts:1330-1345).

User and system context

utils/queryContext.ts:fetchSystemPromptParts() returns { defaultSystemPrompt, userContext, systemContext }:

userContext — cwd, git branch, git status, shell environment, recent files. Prepended as a user message via prependUserContext() (utils/api.ts).
systemContext — OS type + release, appended after the system prompt.

A getCoordinatorUserContext bundle is added when feature('COORDINATOR_MODE') is on (QueryEngine.ts:111-118,302-308).

4.6 Compaction

services/compact/autoCompact.ts:shouldAutoCompact() decides each turn:

Skip if the current query source is itself compaction-related (session_memory, compact, marble_origami) — recursion guard.
Skip if DISABLE_COMPACT / DISABLE_AUTO_COMPACT env or user setting autoCompactEnabled: false.
In reactive-only mode (tengu_cobalt_raccoon), defer to 413 handling.
In context-collapse mode, collapse owns the budget (90% commit / 95% blocking).
Otherwise compact when tokenCountWithEstimation(messages) - snipTokensFreed ≥ getAutoCompactThreshold(model).

getAutoCompactThreshold() = effectiveContextWindow - AUTOCOMPACT_BUFFER_TOKENS, where effectiveContextWindow = contextWindow - MAX_OUTPUT_TOKENS_FOR_SUMMARY (20K). Buffer = 13K by default, overridable.

The compaction process (`services/compact/compact.ts:387`)

Capture pre-compact state: token count, app state, permission context.
Run pre-compact hooks (user can inject custom instructions).
Strip images and reinjected attachments from the messages to summarize.
Call streamCompactSummary() with the compaction prompt (services/compact/prompt.ts). The compaction prompt is worth quoting — it is reproduced in doc 07.
On PROMPT_TOO_LONG_ERROR_MESSAGE, truncate the oldest API-round groups and retry up to 3 times.
Clear the read-file cache; clear nested memory paths.
Generate post-compact attachments:
- Up to 5 most-recently-used files under 5K tokens each / 50K total.
- Deferred tools delta (re-announce tools discovered via ToolSearch).
- Agent listing delta (re-announce available agents).
- MCP instructions delta (re-announce connected servers).
- Plan attachment and plan-mode instructions if in plan mode.
- Skill attachment if skills have been invoked.
Run post-compact session-start hooks.
Emit a createCompactBoundaryMessage with metadata including preservedSegment.tailUuid (so a mid-compact crash still has a recoverable boundary).
Emit a user-visible summary message with isCompactSummary: true, isVisibleInTranscriptOnly: true.
Log a tengu_compact event with true post-compact context size.

The RecompactionInfo type (compact.ts:317-323) carries diagnostic fields (isRecompactionInChain, turnsSincePreviousCompact, previousCompactTurnId, autoCompactThreshold, querySource) to distinguish same-chain recompaction from cross-agent or manual triggers.

4.7 Recovery branches you should know about

These are the transition.reason values, all implemented as explicit branches inside query.ts:

Reason	Trigger	Handling
`next_turn`	Tool results available	Merge messages + assistant responses + tool results; re-enter loop.
`collapse_drain_retry`	Context collapse drained	Re-enter with narrower context.
`reactive_compact_retry`	Got 413 PTL despite threshold check	Compact and retry the same request.
`max_output_tokens_escalate`	Hit max_output_tokens on first try	Raise 8k → 64k, retry.
`max_output_tokens_recovery`	Still truncating at 64k	Enter a multi-turn "continue" loop so the model completes its thought.
`stop_hook_blocking`	A user Stop hook blocked the final message	Surface the block to the model, retry.
`token_budget_continuation`	User-specified token budget not yet met	Auto-continue without asking.

Each reassigns State and loops; no exceptions escape the loop unless maxTurns is exceeded (query.ts:1705).

4.8 Stop hooks

query/stopHooks.ts runs when the model stops without a tool_use. Hooks can inject user-prompt-submit messages, veto the stop (triggering stop_hook_blocking), or run memory extraction. The auto-memory extraction path (services/extractMemories/) is a stop hook that runs a fork of the conversation to write memories to disk when a turn completes successfully.

4.9 Cost and usage tracking

cost-tracker.ts accumulates per-model cost, token counts, and API duration. accumulateUsage/updateUsage from services/api/claude.ts track cache hit/miss, cache-creation tokens, web-search requests. Final result SDK messages include total_cost_usd, usage, modelUsage, permission_denials, fast_mode_state, duration_ms, duration_api_ms, num_turns (QueryEngine.ts:618-637).