Hermes Agent - Agent Core & LLM Integration

The Heart: `run_agent.py`

At 11,500 lines and 585 KB, run_agent.py is the largest file in the codebase. It contains the AIAgent class which orchestrates the entire conversation lifecycle. This document traces the complete execution flow.

AIAgent Construction (`run_agent.py:559-657`)

class AIAgent:
    def __init__(self, model, api_key, base_url, api_mode, ...):

Key initialization:

Parameter	Purpose
`model`	Model identifier (e.g., `anthropic/claude-opus-4.6`)
`api_key`	Provider API key
`base_url`	API endpoint URL
`api_mode`	Protocol: `chat_completions`, `anthropic_messages`, `codex_responses`, `bedrock_converse`
`max_iterations`	Cap on tool-calling loops (default: 90)
`iteration_budget`	Shared `IterationBudget` across parent + subagents
`stream_delta_callback`	Real-time text streaming callback
`tool_progress_callback`	Tool execution progress updates

API Mode Auto-Detection

The constructor detects the correct API mode from provider/URL patterns:

# Simplified logic from run_agent.py:559-657
if provider == "anthropic" or base_url.endswith("/anthropic"):
    api_mode = "anthropic_messages"
elif provider == "bedrock":
    api_mode = "bedrock_converse"
elif provider == "openai-codex":
    api_mode = "codex_responses"
else:
    api_mode = "chat_completions"  # OpenAI-compatible (default)

Main Conversation Cycle: `run_conversation()` (~line 8103)

This is the most critical method. Here's the complete flow:

Phase 1: Pre-Loop Setup (lines 8103-8465)

1. Install safe stdio (broken pipe guard)
2. Set session context for logging
3. Restore primary runtime if fallback was active
4. Sanitize user input:
   - Strip surrogate characters (invalid UTF-8 from clipboard)
   - Remove leaked <memory-context> blocks
5. Build or load cached system prompt:
   - First turn: _build_system_prompt() via prompt_builder.py
   - Continuation: Load from SessionDB (preserves Anthropic prefix cache)
   - Store snapshot in SQLite for session recovery
6. Preflight context compression:
   - Check if loaded history exceeds model's context threshold
   - Compress BEFORE entering main loop to prevent 4xx errors
7. Plugin hooks:
   - Invoke pre_llm_call for context injection from external providers
   - Inject memory/plugin context into user message (NOT system prompt)

Key insight: Memory and plugin context are injected into the user message, not the system prompt. This preserves the system prompt as a stable prefix for Anthropic's prompt caching. The system prompt is built once per session and reused for every API call.

Phase 2: Main Tool-Calling Loop (lines 8465-10900)

while (api_call_count < self.max_iterations and 
       self.iteration_budget.remaining > 0) or self._budget_grace_call:

Each iteration:

Step 1: Interrupt & Budget Check (lines 8469-8490)

if self._interrupt_requested:  # User sent new message mid-run
    interrupted = True
    break
api_call_count += 1
if not self.iteration_budget.consume():
    break  # Budget exhausted

The IterationBudget (lines 170-255) is shared across the parent agent and all subagents spawned via delegate_task. This prevents runaway delegation chains.

Step 2: Message Preparation (lines 8525-8641)

Copy conversation history (avoid mutation)
Inject ephemeral context into user message:
- Memory manager prefetch results (cached once per turn)
- Plugin hook context from pre_llm_call
- These are API-call-time ONLY, never persisted to session DB
Handle extended reasoning:
- Extract reasoning_content from prior assistant messages
- Pass reasoning_details for multi-turn thinking continuity
Build effective system prompt (cached + ephemeral)
Apply prompt caching breakpoints (Anthropic via OpenRouter)
Sanitize: strip orphaned tool results, normalize whitespace

Step 3: API Call with Retry Loop (lines 8675-10135)

while retry_count < max_retries:  # default: 3

A. Build Provider-Specific Request (line 8742):

api_kwargs = self._build_api_kwargs(api_messages)

This constructs payloads specific to each api_mode:

chat_completions: Standard OpenAI format
anthropic_messages: Anthropic's native format with thinking blocks
codex_responses: ChatGPT Codex OAuth format
bedrock_converse: AWS Bedrock Converse format

B. Make Streaming API Call (lines 8806-8810):

if _use_streaming:
    response = self._interruptible_streaming_api_call(api_kwargs, on_first_delta=_stop_spinner)
else:
    response = self._interruptible_api_call(api_kwargs)

Streaming is preferred even without display consumers because it enables:

Health checking: 90-second stale-stream detection
Read timeout: 60-second per-chunk timeout
Interrupt handling: User can cancel mid-stream

C. Validate Response (lines 8830-8991):

Check for None, empty output, missing fields
Extract error metadata and codes (429, 504, 524)
Attempt eager fallback for rate-limit/empty responses
Jittered exponential backoff (base 5s, cap 120s)

D. Extract Finish Reason (lines 9023-9040):

# Provider-specific normalization:
if api_mode == "anthropic_messages":
    stop_reason_map = {"end_turn": "stop", "tool_use": "tool_calls", ...}
elif api_mode == "codex_responses":
    status = getattr(response, "status", None)
else:  # chat_completions
    finish_reason = response.choices[0].finish_reason

E. Handle Truncation (finish_reason='length') (lines 9042-9216):

Detect thinking-budget exhaustion (thinking blocks with no visible text)
Request continuation for text-only truncation (up to 3 attempts)
Roll back to last complete turn if continuation fails

F. Track Token Usage (lines 9219-9325):

usage = normalize_usage(response, api_mode)
# Updates: context compressor, session DB, cost estimator
# Tracks: cache hit percentages (Anthropic/OpenRouter prompt caching)

Step 4: Parse Response (lines 10213-10269)

if api_mode == "codex_responses":
    assistant_message, finish_reason = self._normalize_codex_response(response)
elif api_mode == "anthropic_messages":
    assistant_message, finish_reason = normalize_anthropic_response(response, ...)
else:
    assistant_message = response.choices[0].message

Then _build_assistant_message() (line 6723) normalizes into a common format:

Extracts reasoning from structured fields or <think> blocks
Preserves reasoning_details for multi-turn continuity
Normalizes tool_calls with deterministic IDs
Preserves extra_content (Gemini thought_signature)

Step 5: Validate Tool Calls (lines 10380-10530)

A. Repair Misspelled Tool Names (lines 10389-10436):

repaired_name = self._repair_tool_call(wrong_name)
# Uses fuzzy matching against registered tools
# Returns error to model for self-correction (max 3 retries)

B. Validate JSON Arguments (lines 10440-10527):

Treats empty strings as {}
Detects truncation vs formatting errors
Injects recovery tool results for invalid JSON (max 3 retries)

C. Guardrails (lines 10532-10538):

Caps delegate_task calls (one per turn)
Deduplicates identical tool calls in same turn

Step 6: Execute Tool Calls (lines 10595-10610)

self._execute_tool_calls(assistant_message, messages, effective_task_id, api_call_count)

Dispatches to:

Concurrent execution (_execute_tool_calls_concurrent, line 7294):

Thread pool for independent calls
Read-only tools always parallel
File I/O only parallel when target paths don't overlap
Results collected in original order

Sequential execution (_execute_tool_calls_sequential, line 7532):

Inline invocation with per-tool display handling

Tool invocation flow (_invoke_tool, line 7182):

1. Plugin pre-tool-call block check
2. Built-in tools: todo, session_search, memory, clarify, delegate_task
3. External memory provider tools
4. Registry-dispatched tools via handle_function_call()

Step 7: Post-Tool Processing (lines 10612-10704)

Refund iteration budget if only execute_code was called (cheap RPC-style turn)
Trigger context compression if approaching threshold
Save session log incrementally for interruption recovery

Step 8: Final Response Delivery (lines 10706-10810)

When no tool calls remain:

Validate response has actual content (not just <think> blocks)
Partial stream recovery: Use already-delivered streamed text if connection dropped
Prior-turn content fallback: Use housekeeping-tool-only turn's content
Post-tool empty response nudge: Retry with hint if model goes silent after tools

Error Recovery Hierarchy

The agent implements a 10-level error recovery strategy, tried in order:

Level 1: Surrogate/encoding errors
         → Sanitize invalid UTF-8, retry (max 2 passes)

Level 2: Authentication (401)
         → Refresh Codex/Nous/Anthropic credentials, retry

Level 3: Thinking signature invalid (400 + reasoning_details)
         → Strip reasoning_details, retry

Level 4: Credential pool rotation (429, 403)
         → Rotate API keys from credential pool

Level 5: Context compression (413, context overflow)
         → Compress middle turns via auxiliary LLM, retry

Level 6: Output cap adjustment (context overflow with available_out)
         → Parse available output tokens from error, reduce max_tokens

Level 7: Primary transport recovery (transient errors)
         → Rebuild HTTP client once, retry

Level 8: Fallback provider activation
         → Switch to configured fallback model/provider

Level 9: Retry with backoff
         → Jittered exponential (base 5s, cap 120s)

Level 10: Abort with guidance
          → Provide actionable error message to user

Provider Adapters

Anthropic Native (`agent/anthropic_adapter.py`)

OAuth tokens (sk-ant-oat*), API keys (sk-ant-api*), Claude Code credentials
Adaptive thinking budget: xhigh→max, high→high, etc.
Beta headers: interleaved-thinking-2025-05-14
Output token limits per model: Claude 4.6 = 128K, Haiku 4.5 = 64K
Auto-inject cache_control breakpoints for prompt caching

Chat Completions (Internal)

Standard OpenAI-compatible API
Works with: OpenRouter, Nous Portal, Gemini, Kimi, MiniMax, custom endpoints
Tool calls normalized between OpenAI format and internal representation

Codex Responses (`run_agent.py:4504-4941`)

ChatGPT Codex OAuth backend via https://chatgpt.com/backend-api/codex
Wraps Responses API to look like chat.completions
Preserves call_id and response_item_id for multi-turn continuity

AWS Bedrock (`agent/bedrock_adapter.py`)

Bedrock Converse API for Claude models on AWS
Auto-detected from provider="bedrock" or URL containing bedrock-runtime

Auxiliary Client (`agent/auxiliary_client.py`)

Resolves side-task providers (compression, search, vision) via auto-detection:

Text:   OpenRouter → Nous → Custom endpoint → Codex → Anthropic → API-key providers
Vision: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom

Payment fallback: auto-retries with next provider on HTTP 402 (insufficient credits).

Context Compression (`trajectory_compressor.py`, `agent/context_compressor.py`)

Strategy

Messages:  [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
            ├─ PROTECTED ─┤  ├── COMPRESSED ──┤  ├─ PROTECTED ──┤
            first system,     middle turns        last N turns
            first human,      (summarized by      (recent context)
            first assistant   auxiliary LLM)

Configuration

Trigger: When tokens >= 50% of model's context window (threshold: 0.50)
Target: Keep 20% of threshold as recent context (target_ratio: 0.20)
Protected: First system + first human + first assistant + last N turns
Method: Summarize middle region via auxiliary LLM, replace with single human message

Context Probing

When context-overflow errors occur:

Step down through probe tiers
Cache discovered limits to disk (only confirmed values from error messages)
Falls back to guessed tiers (in-memory only, not persisted)

Streaming & Callbacks

Callback	Purpose
`stream_delta_callback`	Text chunks during streaming (feeds TTS pipeline)
`reasoning_callback`	Structured reasoning blocks
`thinking_callback`	Spinner animation status
`tool_progress_callback`	Tool execution progress
`interim_assistant_callback`	Intermediate messages before tools
`step_callback`	Per-iteration with prior tool results (gateway hooks)
`status_callback`	Status updates (rate limits, compression, etc.)

Iteration Budget (`run_agent.py:170-255`)

class IterationBudget:
    def __init__(self, total: int = 90):
        self.total = total
        self.remaining = total
        self._lock = threading.Lock()
    
    def consume(self) -> bool:
        with self._lock:
            if self.remaining <= 0:
                return False
            self.remaining -= 1
            return True

Shared across parent agent and all subagents (via delegate_task)
Each LLM turn consumes one unit
Grace call: one extra attempt when budget hits zero (let model finish)
Prevents runaway delegation chains from burning through API credits

Session Persistence (`hermes_state.py`)

SQLite Schema

CREATE TABLE sessions (
    session_id TEXT PRIMARY KEY,
    parent_session_id TEXT,       -- compression chains
    title TEXT,
    source TEXT,                  -- 'cli', 'telegram', 'discord', etc.
    started_at TIMESTAMP,
    last_active TIMESTAMP,
    message_count INTEGER,
    prompt_tokens INTEGER,
    completion_tokens INTEGER,
    total_tokens INTEGER,
    estimated_cost REAL,
    actual_cost REAL
);
 
CREATE TABLE messages (
    id INTEGER PRIMARY KEY,
    session_id TEXT,
    role TEXT,                    -- 'system', 'user', 'assistant', 'tool'
    content TEXT,
    tool_calls TEXT,             -- JSON
    tool_call_id TEXT,
    finish_reason TEXT,
    reasoning TEXT,
    created_at TIMESTAMP
);
 
CREATE VIRTUAL TABLE messages_fts USING fts5(content, content=messages);

Key Implementation Details

WAL mode for concurrent readers + single writer (gateway multi-platform)
BEGIN IMMEDIATE with jittered retry (20-150ms) to avoid SQLite convoy
Checkpoint every 50 writes to manage WAL file growth
FTS5 for full-text session search across all historical conversations