CodeDocs Vault

4. The Agent Loop and LLM Usage

This is the most interesting doc if you want to learn from this codebase. The agent loop at query.ts:219 and the API layer at services/api/claude.ts together contain the bulk of the system's "intelligence about intelligence" — how it drives the model safely, cheaply, and reliably.

4.1 Shape of the loop

query() at query.ts:219 is a thin wrapper around the actual generator queryLoop(). It is an async generator that yields SDKMessage events and maintains all state via reassignment in a while (true) block. The generator never calls itself — recursion is modeled by reassigning a State object and looping.

Key fields on State (introduced at query.ts:203):

Every loop iteration is one of: continue (reassign state), yield (emit a message), or return (terminal reason).

4.2 Streaming and tool dispatch

Streaming

The model is invoked via deps.callModel() (query.ts:659) which wraps queryModel() in claude.ts:1017. It is an async generator; the loop consumes it with a for await. As blocks arrive:

The loop does not rely on stop_reason to decide whether to continue (see the comment at query.ts:554). Instead it checks needsFollowUp: if the assistant emitted any tool_use, we need to feed results back.

Parallel tool execution

If streamingToolExecution is enabled (query.ts:561), a StreamingToolExecutor (query.ts:563-568) starts tools while the model is still streaming. Completed tool results are yielded immediately (query.ts:851-862) and accumulate in toolResults. This lets independent calls run in parallel — a substantial latency win when the model emits multiple tool_use blocks in one turn.

Tools declare whether they can run in parallel with themselves via isConcurrencySafe(input) (Tool.ts:402). Tools declare reaction to interrupt via interruptBehavior()'cancel' to abort on a new user message, 'block' to finish first (Tool.ts:411-416).

Fallback and orphan tombstoning

If streaming throws FallbackTriggeredError (query.ts:894), the loop:

  1. Tombstones orphan assistant messages — any in-progress tool_use gets a synthetic tool_result block reading "Model fallback triggered" (query.ts:900-903). This is critical: unpaired tool_use blocks would make subsequent API calls fail with a 400 ("tool_use without tool_result").
  2. Clears the streaming executor and recreates it (query.ts:734-739) — otherwise a stale executor could emit a tool_result for a tool_use that was never logged.
  3. Switches current model (query.ts:896), retries with attemptWithFallback=true (query.ts:897).
  4. Strips thinking signatures for non-matching fallback models (query.ts:928) — a signed thinking block from model A will 400 on model B ("thinking blocks cannot be modified").

4.3 Thinking (extended thinking)

Extended thinking has strict API invariants that the loop honors (query.ts:151-163):

  1. A thinking block must be part of a query with max_thinking_length > 0.
  2. A thinking block may not be the last block in a content array.
  3. Thinking blocks must be preserved for the entire assistant trajectory (the turn, or if tool_use blocks exist, also the tool_results and the following assistant message).

Adaptive vs budgeted

claude.ts:1596-1629:

Adaptive is the default on Opus/Sonnet 4.6+ (utils/thinking.ts:120-129 allowlists; 113-144 describes the decision tree).

Signature handling across turns

Thinking signatures are preserved intact across turns in the conversation (so the API validates them on re-submission). The only time they're stripped is on a mid-run model switch (query.ts:928) — because signatures are model-bound. services/api/claude.ts:1632-1637 also describes a sticky "clear all thinking" latch (thinkingClearLatched) that trips if 1+ hour has passed since the last API call — thinking chain is too stale to be productive.

4.4 API client and retry (services/api/claude.ts, services/api/withRetry.ts)

Streaming setup

queryModel() at claude.ts:1017 calls withRetry() which creates an SDK stream via anthropic.beta.messages.create({..., stream: true}). The stream is consumed event-by-event. paramsFromContext() (claude.ts:1538-1729) computes beta headers, thinking config, temperature, effort, prompt-cache setup per call.

Retry classification (services/api/withRetry.ts:170-440)

Carefully domain-specific:

Symptom Response
401 / 403 revoked Force client refresh (withRetry.ts:240-250).
ECONNRESET / EPIPE Disable keep-alive and reconnect (218-229).
429 / 529 in fast mode, short retry-after (<1s) Retry in fast mode to preserve cache. Long retry-after → enter cooldown, switch to standard speed (267-305).
Fast mode not enabled by org Disable fast mode and retry (310-314).
529 with fallback model Count consecutive 529s; after 3 throw FallbackTriggeredError for the caller to model-switch (331-350).
429 for non-foreground source Bail without retry (318-324).
Other retryable Backoff up to 10 retries (179).

CLAUDE_CODE_UNATTENDED_RETRY (ant-only): 429/529 retry indefinitely with up to 5-min backoff, 30s keep-alive heartbeats so idle-kills don't land (withRetry.ts:96-104).

Prompt caching

Toggled per-model via getPromptCachingEnabled() (claude.ts:333-356). Scope is 'ephemeral' by default; upgraded to ttl: '1h' for eligible users (358-434), and scope: 'global' when no MCP tools require a dynamic boundary — global scope means the cache is shared across organizations for truly static content.

The static/dynamic boundary is a literal sentinel: SYSTEM_PROMPT_DYNAMIC_BOUNDARY = '__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__' (constants/prompts.ts:114-115). Everything BEFORE this string in the system prompt can use scope: 'global'. Everything AFTER contains user/session-specific content. The splitter lives in utils/api.ts:splitSysPromptPrefix and services/api/claude.ts:buildSystemPromptBlocks.

Latched flags to preserve cache

claude.ts:1405-1442 introduces sticky-on latches for flags whose flip would bust ~50-70K cache tokens: fast mode, AFK mode, cache editing, thinking-clear. Once on, they stay on for the session so mid-session toggles don't invalidate the prefix. Per-call dynamic flags (isAgenticQuery, querySource) are isolated past the boundary.

Beta header assembly

Headers change with feature mode. getMergedBetas() pulls them together: context-1m-2025-08-07, fast-mode, AFK, cache-editing, tool-search, 1M context, effort, structured outputs, task budgets. Every flag that might appear in a cached prefix is deliberately evaluated before the boundary marker.

4.5 System-prompt composition

constants/prompts.ts:444 exposes getSystemPrompt(tools, model, additionalWorkingDirectories?, mcpClients?) which returns string[] (joined with \n by the API layer). When CLAUDE_CODE_SIMPLE=1 it returns just:

You are Claude Code, Anthropic's official CLI for Claude.

CWD: <cwd>
Date: <date>

Otherwise it assembles two halves:

[STATIC — cacheable with scope:'global']
  getSimpleIntroSection(outputStyleConfig)
  getSimpleSystemSection()
  [getSimpleDoingTasksSection()]                  // unless output style suppresses it
  getActionsSection()
  getUsingYourToolsSection(enabledTools)
  getSimpleToneAndStyleSection()
  getOutputEfficiencySection()

  SYSTEM_PROMPT_DYNAMIC_BOUNDARY                  // literal sentinel string

[DYNAMIC — per-session, resolved via registry]
  session_guidance
  memory (loadMemoryPrompt)
  ant_model_override
  env_info_simple                                 // cwd, git state, platform, model, knowledge cutoff
  language                                        // if user set a language pref
  output_style
  mcp_instructions
  scratchpad
  frc (function result clearing)
  summarize_tool_results
  [numeric_length_anchors]                        // ant-only; 1.2% output-token reduction per eval
  [token_budget]                                  // if feature('TOKEN_BUDGET')
  [brief]                                         // if KAIROS feature

The sections are implemented as systemPromptSection('name', () => promise-of-text) memoized until /clear or /compact. A small number of sections that genuinely need per-turn recomputation use DANGEROUS_uncachedSystemPromptSection (constants/prompts.ts:513).

Tool descriptions

Tool schemas are built from enabled tools (claude.ts:1235-1246) via toolToAPISchema() with a defer_loading: true flag for deferred tools. When deferred-tool delta attachment is off, the list of deferred tools is injected as a synthetic user message with an XML-tagged block (claude.ts:1330-1345).

User and system context

utils/queryContext.ts:fetchSystemPromptParts() returns { defaultSystemPrompt, userContext, systemContext }:

A getCoordinatorUserContext bundle is added when feature('COORDINATOR_MODE') is on (QueryEngine.ts:111-118,302-308).

4.6 Compaction

services/compact/autoCompact.ts:shouldAutoCompact() decides each turn:

  1. Skip if the current query source is itself compaction-related (session_memory, compact, marble_origami) — recursion guard.
  2. Skip if DISABLE_COMPACT / DISABLE_AUTO_COMPACT env or user setting autoCompactEnabled: false.
  3. In reactive-only mode (tengu_cobalt_raccoon), defer to 413 handling.
  4. In context-collapse mode, collapse owns the budget (90% commit / 95% blocking).
  5. Otherwise compact when tokenCountWithEstimation(messages) - snipTokensFreed ≥ getAutoCompactThreshold(model).

getAutoCompactThreshold() = effectiveContextWindow - AUTOCOMPACT_BUFFER_TOKENS, where effectiveContextWindow = contextWindow - MAX_OUTPUT_TOKENS_FOR_SUMMARY (20K). Buffer = 13K by default, overridable.

The compaction process (services/compact/compact.ts:387)

  1. Capture pre-compact state: token count, app state, permission context.
  2. Run pre-compact hooks (user can inject custom instructions).
  3. Strip images and reinjected attachments from the messages to summarize.
  4. Call streamCompactSummary() with the compaction prompt (services/compact/prompt.ts). The compaction prompt is worth quoting — it is reproduced in doc 07.
  5. On PROMPT_TOO_LONG_ERROR_MESSAGE, truncate the oldest API-round groups and retry up to 3 times.
  6. Clear the read-file cache; clear nested memory paths.
  7. Generate post-compact attachments:
    • Up to 5 most-recently-used files under 5K tokens each / 50K total.
    • Deferred tools delta (re-announce tools discovered via ToolSearch).
    • Agent listing delta (re-announce available agents).
    • MCP instructions delta (re-announce connected servers).
    • Plan attachment and plan-mode instructions if in plan mode.
    • Skill attachment if skills have been invoked.
  8. Run post-compact session-start hooks.
  9. Emit a createCompactBoundaryMessage with metadata including preservedSegment.tailUuid (so a mid-compact crash still has a recoverable boundary).
  10. Emit a user-visible summary message with isCompactSummary: true, isVisibleInTranscriptOnly: true.
  11. Log a tengu_compact event with true post-compact context size.

The RecompactionInfo type (compact.ts:317-323) carries diagnostic fields (isRecompactionInChain, turnsSincePreviousCompact, previousCompactTurnId, autoCompactThreshold, querySource) to distinguish same-chain recompaction from cross-agent or manual triggers.

4.7 Recovery branches you should know about

These are the transition.reason values, all implemented as explicit branches inside query.ts:

Reason Trigger Handling
next_turn Tool results available Merge messages + assistant responses + tool results; re-enter loop.
collapse_drain_retry Context collapse drained Re-enter with narrower context.
reactive_compact_retry Got 413 PTL despite threshold check Compact and retry the same request.
max_output_tokens_escalate Hit max_output_tokens on first try Raise 8k → 64k, retry.
max_output_tokens_recovery Still truncating at 64k Enter a multi-turn "continue" loop so the model completes its thought.
stop_hook_blocking A user Stop hook blocked the final message Surface the block to the model, retry.
token_budget_continuation User-specified token budget not yet met Auto-continue without asking.

Each reassigns State and loops; no exceptions escape the loop unless maxTurns is exceeded (query.ts:1705).

4.8 Stop hooks

query/stopHooks.ts runs when the model stops without a tool_use. Hooks can inject user-prompt-submit messages, veto the stop (triggering stop_hook_blocking), or run memory extraction. The auto-memory extraction path (services/extractMemories/) is a stop hook that runs a fork of the conversation to write memories to disk when a turn completes successfully.

4.9 Cost and usage tracking

cost-tracker.ts accumulates per-model cost, token counts, and API duration. accumulateUsage/updateUsage from services/api/claude.ts track cache hit/miss, cache-creation tokens, web-search requests. Final result SDK messages include total_cost_usd, usage, modelUsage, permission_denials, fast_mode_state, duration_ms, duration_api_ms, num_turns (QueryEngine.ts:618-637).