CodeDocs Vault

Hermes Agent - Agent Core & LLM Integration

The Heart: run_agent.py

At 11,500 lines and 585 KB, run_agent.py is the largest file in the codebase. It contains the AIAgent class which orchestrates the entire conversation lifecycle. This document traces the complete execution flow.

AIAgent Construction (run_agent.py:559-657)

class AIAgent:
    def __init__(self, model, api_key, base_url, api_mode, ...):

Key initialization:

Parameter Purpose
model Model identifier (e.g., anthropic/claude-opus-4.6)
api_key Provider API key
base_url API endpoint URL
api_mode Protocol: chat_completions, anthropic_messages, codex_responses, bedrock_converse
max_iterations Cap on tool-calling loops (default: 90)
iteration_budget Shared IterationBudget across parent + subagents
stream_delta_callback Real-time text streaming callback
tool_progress_callback Tool execution progress updates

API Mode Auto-Detection

The constructor detects the correct API mode from provider/URL patterns:

# Simplified logic from run_agent.py:559-657
if provider == "anthropic" or base_url.endswith("/anthropic"):
    api_mode = "anthropic_messages"
elif provider == "bedrock":
    api_mode = "bedrock_converse"
elif provider == "openai-codex":
    api_mode = "codex_responses"
else:
    api_mode = "chat_completions"  # OpenAI-compatible (default)

Main Conversation Cycle: run_conversation() (~line 8103)

This is the most critical method. Here's the complete flow:

Phase 1: Pre-Loop Setup (lines 8103-8465)

1. Install safe stdio (broken pipe guard)
2. Set session context for logging
3. Restore primary runtime if fallback was active
4. Sanitize user input:
   - Strip surrogate characters (invalid UTF-8 from clipboard)
   - Remove leaked <memory-context> blocks
5. Build or load cached system prompt:
   - First turn: _build_system_prompt() via prompt_builder.py
   - Continuation: Load from SessionDB (preserves Anthropic prefix cache)
   - Store snapshot in SQLite for session recovery
6. Preflight context compression:
   - Check if loaded history exceeds model's context threshold
   - Compress BEFORE entering main loop to prevent 4xx errors
7. Plugin hooks:
   - Invoke pre_llm_call for context injection from external providers
   - Inject memory/plugin context into user message (NOT system prompt)

Key insight: Memory and plugin context are injected into the user message, not the system prompt. This preserves the system prompt as a stable prefix for Anthropic's prompt caching. The system prompt is built once per session and reused for every API call.

Phase 2: Main Tool-Calling Loop (lines 8465-10900)

while (api_call_count < self.max_iterations and 
       self.iteration_budget.remaining > 0) or self._budget_grace_call:

Each iteration:

Step 1: Interrupt & Budget Check (lines 8469-8490)

if self._interrupt_requested:  # User sent new message mid-run
    interrupted = True
    break
api_call_count += 1
if not self.iteration_budget.consume():
    break  # Budget exhausted

The IterationBudget (lines 170-255) is shared across the parent agent and all subagents spawned via delegate_task. This prevents runaway delegation chains.

Step 2: Message Preparation (lines 8525-8641)

  1. Copy conversation history (avoid mutation)
  2. Inject ephemeral context into user message:
    • Memory manager prefetch results (cached once per turn)
    • Plugin hook context from pre_llm_call
    • These are API-call-time ONLY, never persisted to session DB
  3. Handle extended reasoning:
    • Extract reasoning_content from prior assistant messages
    • Pass reasoning_details for multi-turn thinking continuity
  4. Build effective system prompt (cached + ephemeral)
  5. Apply prompt caching breakpoints (Anthropic via OpenRouter)
  6. Sanitize: strip orphaned tool results, normalize whitespace

Step 3: API Call with Retry Loop (lines 8675-10135)

while retry_count < max_retries:  # default: 3

A. Build Provider-Specific Request (line 8742):

api_kwargs = self._build_api_kwargs(api_messages)

This constructs payloads specific to each api_mode:

B. Make Streaming API Call (lines 8806-8810):

if _use_streaming:
    response = self._interruptible_streaming_api_call(api_kwargs, on_first_delta=_stop_spinner)
else:
    response = self._interruptible_api_call(api_kwargs)

Streaming is preferred even without display consumers because it enables:

C. Validate Response (lines 8830-8991):

D. Extract Finish Reason (lines 9023-9040):

# Provider-specific normalization:
if api_mode == "anthropic_messages":
    stop_reason_map = {"end_turn": "stop", "tool_use": "tool_calls", ...}
elif api_mode == "codex_responses":
    status = getattr(response, "status", None)
else:  # chat_completions
    finish_reason = response.choices[0].finish_reason

E. Handle Truncation (finish_reason='length') (lines 9042-9216):

F. Track Token Usage (lines 9219-9325):

usage = normalize_usage(response, api_mode)
# Updates: context compressor, session DB, cost estimator
# Tracks: cache hit percentages (Anthropic/OpenRouter prompt caching)

Step 4: Parse Response (lines 10213-10269)

if api_mode == "codex_responses":
    assistant_message, finish_reason = self._normalize_codex_response(response)
elif api_mode == "anthropic_messages":
    assistant_message, finish_reason = normalize_anthropic_response(response, ...)
else:
    assistant_message = response.choices[0].message

Then _build_assistant_message() (line 6723) normalizes into a common format:

Step 5: Validate Tool Calls (lines 10380-10530)

A. Repair Misspelled Tool Names (lines 10389-10436):

repaired_name = self._repair_tool_call(wrong_name)
# Uses fuzzy matching against registered tools
# Returns error to model for self-correction (max 3 retries)

B. Validate JSON Arguments (lines 10440-10527):

C. Guardrails (lines 10532-10538):

Step 6: Execute Tool Calls (lines 10595-10610)

self._execute_tool_calls(assistant_message, messages, effective_task_id, api_call_count)

Dispatches to:

Concurrent execution (_execute_tool_calls_concurrent, line 7294):

Sequential execution (_execute_tool_calls_sequential, line 7532):

Tool invocation flow (_invoke_tool, line 7182):

1. Plugin pre-tool-call block check
2. Built-in tools: todo, session_search, memory, clarify, delegate_task
3. External memory provider tools
4. Registry-dispatched tools via handle_function_call()

Step 7: Post-Tool Processing (lines 10612-10704)

Step 8: Final Response Delivery (lines 10706-10810)

When no tool calls remain:

  1. Validate response has actual content (not just <think> blocks)
  2. Partial stream recovery: Use already-delivered streamed text if connection dropped
  3. Prior-turn content fallback: Use housekeeping-tool-only turn's content
  4. Post-tool empty response nudge: Retry with hint if model goes silent after tools

Error Recovery Hierarchy

The agent implements a 10-level error recovery strategy, tried in order:

Level 1: Surrogate/encoding errors
         → Sanitize invalid UTF-8, retry (max 2 passes)

Level 2: Authentication (401)
         → Refresh Codex/Nous/Anthropic credentials, retry

Level 3: Thinking signature invalid (400 + reasoning_details)
         → Strip reasoning_details, retry

Level 4: Credential pool rotation (429, 403)
         → Rotate API keys from credential pool

Level 5: Context compression (413, context overflow)
         → Compress middle turns via auxiliary LLM, retry

Level 6: Output cap adjustment (context overflow with available_out)
         → Parse available output tokens from error, reduce max_tokens

Level 7: Primary transport recovery (transient errors)
         → Rebuild HTTP client once, retry

Level 8: Fallback provider activation
         → Switch to configured fallback model/provider

Level 9: Retry with backoff
         → Jittered exponential (base 5s, cap 120s)

Level 10: Abort with guidance
          → Provide actionable error message to user

Provider Adapters

Anthropic Native (agent/anthropic_adapter.py)

Chat Completions (Internal)

Codex Responses (run_agent.py:4504-4941)

AWS Bedrock (agent/bedrock_adapter.py)

Auxiliary Client (agent/auxiliary_client.py)

Resolves side-task providers (compression, search, vision) via auto-detection:

Text:   OpenRouter → Nous → Custom endpoint → Codex → Anthropic → API-key providers
Vision: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom

Payment fallback: auto-retries with next provider on HTTP 402 (insufficient credits).

Context Compression (trajectory_compressor.py, agent/context_compressor.py)

Strategy

Messages:  [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
            ├─ PROTECTED ─┤  ├── COMPRESSED ──┤  ├─ PROTECTED ──┤
            first system,     middle turns        last N turns
            first human,      (summarized by      (recent context)
            first assistant   auxiliary LLM)

Configuration

Context Probing

When context-overflow errors occur:

  1. Step down through probe tiers
  2. Cache discovered limits to disk (only confirmed values from error messages)
  3. Falls back to guessed tiers (in-memory only, not persisted)

Streaming & Callbacks

Callback Purpose
stream_delta_callback Text chunks during streaming (feeds TTS pipeline)
reasoning_callback Structured reasoning blocks
thinking_callback Spinner animation status
tool_progress_callback Tool execution progress
interim_assistant_callback Intermediate messages before tools
step_callback Per-iteration with prior tool results (gateway hooks)
status_callback Status updates (rate limits, compression, etc.)

Iteration Budget (run_agent.py:170-255)

class IterationBudget:
    def __init__(self, total: int = 90):
        self.total = total
        self.remaining = total
        self._lock = threading.Lock()
    
    def consume(self) -> bool:
        with self._lock:
            if self.remaining <= 0:
                return False
            self.remaining -= 1
            return True

Session Persistence (hermes_state.py)

SQLite Schema

CREATE TABLE sessions (
    session_id TEXT PRIMARY KEY,
    parent_session_id TEXT,       -- compression chains
    title TEXT,
    source TEXT,                  -- 'cli', 'telegram', 'discord', etc.
    started_at TIMESTAMP,
    last_active TIMESTAMP,
    message_count INTEGER,
    prompt_tokens INTEGER,
    completion_tokens INTEGER,
    total_tokens INTEGER,
    estimated_cost REAL,
    actual_cost REAL
);
 
CREATE TABLE messages (
    id INTEGER PRIMARY KEY,
    session_id TEXT,
    role TEXT,                    -- 'system', 'user', 'assistant', 'tool'
    content TEXT,
    tool_calls TEXT,             -- JSON
    tool_call_id TEXT,
    finish_reason TEXT,
    reasoning TEXT,
    created_at TIMESTAMP
);
 
CREATE VIRTUAL TABLE messages_fts USING fts5(content, content=messages);

Key Implementation Details