Hermes Agent - Agent Core & LLM Integration
The Heart: run_agent.py
At 11,500 lines and 585 KB, run_agent.py is the largest file in the codebase. It contains the AIAgent class which orchestrates the entire conversation lifecycle. This document traces the complete execution flow.
AIAgent Construction (run_agent.py:559-657)
class AIAgent:
def __init__(self, model, api_key, base_url, api_mode, ...):Key initialization:
| Parameter | Purpose |
|---|---|
model |
Model identifier (e.g., anthropic/claude-opus-4.6) |
api_key |
Provider API key |
base_url |
API endpoint URL |
api_mode |
Protocol: chat_completions, anthropic_messages, codex_responses, bedrock_converse |
max_iterations |
Cap on tool-calling loops (default: 90) |
iteration_budget |
Shared IterationBudget across parent + subagents |
stream_delta_callback |
Real-time text streaming callback |
tool_progress_callback |
Tool execution progress updates |
API Mode Auto-Detection
The constructor detects the correct API mode from provider/URL patterns:
# Simplified logic from run_agent.py:559-657
if provider == "anthropic" or base_url.endswith("/anthropic"):
api_mode = "anthropic_messages"
elif provider == "bedrock":
api_mode = "bedrock_converse"
elif provider == "openai-codex":
api_mode = "codex_responses"
else:
api_mode = "chat_completions" # OpenAI-compatible (default)Main Conversation Cycle: run_conversation() (~line 8103)
This is the most critical method. Here's the complete flow:
Phase 1: Pre-Loop Setup (lines 8103-8465)
1. Install safe stdio (broken pipe guard)
2. Set session context for logging
3. Restore primary runtime if fallback was active
4. Sanitize user input:
- Strip surrogate characters (invalid UTF-8 from clipboard)
- Remove leaked <memory-context> blocks
5. Build or load cached system prompt:
- First turn: _build_system_prompt() via prompt_builder.py
- Continuation: Load from SessionDB (preserves Anthropic prefix cache)
- Store snapshot in SQLite for session recovery
6. Preflight context compression:
- Check if loaded history exceeds model's context threshold
- Compress BEFORE entering main loop to prevent 4xx errors
7. Plugin hooks:
- Invoke pre_llm_call for context injection from external providers
- Inject memory/plugin context into user message (NOT system prompt)
Key insight: Memory and plugin context are injected into the user message, not the system prompt. This preserves the system prompt as a stable prefix for Anthropic's prompt caching. The system prompt is built once per session and reused for every API call.
Phase 2: Main Tool-Calling Loop (lines 8465-10900)
while (api_call_count < self.max_iterations and
self.iteration_budget.remaining > 0) or self._budget_grace_call:Each iteration:
Step 1: Interrupt & Budget Check (lines 8469-8490)
if self._interrupt_requested: # User sent new message mid-run
interrupted = True
break
api_call_count += 1
if not self.iteration_budget.consume():
break # Budget exhaustedThe IterationBudget (lines 170-255) is shared across the parent agent and all subagents spawned via delegate_task. This prevents runaway delegation chains.
Step 2: Message Preparation (lines 8525-8641)
- Copy conversation history (avoid mutation)
- Inject ephemeral context into user message:
- Memory manager prefetch results (cached once per turn)
- Plugin hook context from
pre_llm_call - These are API-call-time ONLY, never persisted to session DB
- Handle extended reasoning:
- Extract
reasoning_contentfrom prior assistant messages - Pass
reasoning_detailsfor multi-turn thinking continuity
- Extract
- Build effective system prompt (cached + ephemeral)
- Apply prompt caching breakpoints (Anthropic via OpenRouter)
- Sanitize: strip orphaned tool results, normalize whitespace
Step 3: API Call with Retry Loop (lines 8675-10135)
while retry_count < max_retries: # default: 3A. Build Provider-Specific Request (line 8742):
api_kwargs = self._build_api_kwargs(api_messages)This constructs payloads specific to each api_mode:
chat_completions: Standard OpenAI formatanthropic_messages: Anthropic's native format with thinking blockscodex_responses: ChatGPT Codex OAuth formatbedrock_converse: AWS Bedrock Converse format
B. Make Streaming API Call (lines 8806-8810):
if _use_streaming:
response = self._interruptible_streaming_api_call(api_kwargs, on_first_delta=_stop_spinner)
else:
response = self._interruptible_api_call(api_kwargs)Streaming is preferred even without display consumers because it enables:
- Health checking: 90-second stale-stream detection
- Read timeout: 60-second per-chunk timeout
- Interrupt handling: User can cancel mid-stream
C. Validate Response (lines 8830-8991):
- Check for None, empty output, missing fields
- Extract error metadata and codes (429, 504, 524)
- Attempt eager fallback for rate-limit/empty responses
- Jittered exponential backoff (base 5s, cap 120s)
D. Extract Finish Reason (lines 9023-9040):
# Provider-specific normalization:
if api_mode == "anthropic_messages":
stop_reason_map = {"end_turn": "stop", "tool_use": "tool_calls", ...}
elif api_mode == "codex_responses":
status = getattr(response, "status", None)
else: # chat_completions
finish_reason = response.choices[0].finish_reasonE. Handle Truncation (finish_reason='length') (lines 9042-9216):
- Detect thinking-budget exhaustion (thinking blocks with no visible text)
- Request continuation for text-only truncation (up to 3 attempts)
- Roll back to last complete turn if continuation fails
F. Track Token Usage (lines 9219-9325):
usage = normalize_usage(response, api_mode)
# Updates: context compressor, session DB, cost estimator
# Tracks: cache hit percentages (Anthropic/OpenRouter prompt caching)Step 4: Parse Response (lines 10213-10269)
if api_mode == "codex_responses":
assistant_message, finish_reason = self._normalize_codex_response(response)
elif api_mode == "anthropic_messages":
assistant_message, finish_reason = normalize_anthropic_response(response, ...)
else:
assistant_message = response.choices[0].messageThen _build_assistant_message() (line 6723) normalizes into a common format:
- Extracts reasoning from structured fields or
<think>blocks - Preserves
reasoning_detailsfor multi-turn continuity - Normalizes
tool_callswith deterministic IDs - Preserves
extra_content(Gemini thought_signature)
Step 5: Validate Tool Calls (lines 10380-10530)
A. Repair Misspelled Tool Names (lines 10389-10436):
repaired_name = self._repair_tool_call(wrong_name)
# Uses fuzzy matching against registered tools
# Returns error to model for self-correction (max 3 retries)B. Validate JSON Arguments (lines 10440-10527):
- Treats empty strings as
{} - Detects truncation vs formatting errors
- Injects recovery tool results for invalid JSON (max 3 retries)
C. Guardrails (lines 10532-10538):
- Caps
delegate_taskcalls (one per turn) - Deduplicates identical tool calls in same turn
Step 6: Execute Tool Calls (lines 10595-10610)
self._execute_tool_calls(assistant_message, messages, effective_task_id, api_call_count)Dispatches to:
Concurrent execution (_execute_tool_calls_concurrent, line 7294):
- Thread pool for independent calls
- Read-only tools always parallel
- File I/O only parallel when target paths don't overlap
- Results collected in original order
Sequential execution (_execute_tool_calls_sequential, line 7532):
- Inline invocation with per-tool display handling
Tool invocation flow (_invoke_tool, line 7182):
1. Plugin pre-tool-call block check
2. Built-in tools: todo, session_search, memory, clarify, delegate_task
3. External memory provider tools
4. Registry-dispatched tools via handle_function_call()
Step 7: Post-Tool Processing (lines 10612-10704)
- Refund iteration budget if only
execute_codewas called (cheap RPC-style turn) - Trigger context compression if approaching threshold
- Save session log incrementally for interruption recovery
Step 8: Final Response Delivery (lines 10706-10810)
When no tool calls remain:
- Validate response has actual content (not just
<think>blocks) - Partial stream recovery: Use already-delivered streamed text if connection dropped
- Prior-turn content fallback: Use housekeeping-tool-only turn's content
- Post-tool empty response nudge: Retry with hint if model goes silent after tools
Error Recovery Hierarchy
The agent implements a 10-level error recovery strategy, tried in order:
Level 1: Surrogate/encoding errors
→ Sanitize invalid UTF-8, retry (max 2 passes)
Level 2: Authentication (401)
→ Refresh Codex/Nous/Anthropic credentials, retry
Level 3: Thinking signature invalid (400 + reasoning_details)
→ Strip reasoning_details, retry
Level 4: Credential pool rotation (429, 403)
→ Rotate API keys from credential pool
Level 5: Context compression (413, context overflow)
→ Compress middle turns via auxiliary LLM, retry
Level 6: Output cap adjustment (context overflow with available_out)
→ Parse available output tokens from error, reduce max_tokens
Level 7: Primary transport recovery (transient errors)
→ Rebuild HTTP client once, retry
Level 8: Fallback provider activation
→ Switch to configured fallback model/provider
Level 9: Retry with backoff
→ Jittered exponential (base 5s, cap 120s)
Level 10: Abort with guidance
→ Provide actionable error message to user
Provider Adapters
Anthropic Native (agent/anthropic_adapter.py)
- OAuth tokens (
sk-ant-oat*), API keys (sk-ant-api*), Claude Code credentials - Adaptive thinking budget:
xhigh→max,high→high, etc. - Beta headers:
interleaved-thinking-2025-05-14 - Output token limits per model: Claude 4.6 = 128K, Haiku 4.5 = 64K
- Auto-inject
cache_controlbreakpoints for prompt caching
Chat Completions (Internal)
- Standard OpenAI-compatible API
- Works with: OpenRouter, Nous Portal, Gemini, Kimi, MiniMax, custom endpoints
- Tool calls normalized between OpenAI format and internal representation
Codex Responses (run_agent.py:4504-4941)
- ChatGPT Codex OAuth backend via
https://chatgpt.com/backend-api/codex - Wraps Responses API to look like chat.completions
- Preserves
call_idandresponse_item_idfor multi-turn continuity
AWS Bedrock (agent/bedrock_adapter.py)
- Bedrock Converse API for Claude models on AWS
- Auto-detected from
provider="bedrock"or URL containingbedrock-runtime
Auxiliary Client (agent/auxiliary_client.py)
Resolves side-task providers (compression, search, vision) via auto-detection:
Text: OpenRouter → Nous → Custom endpoint → Codex → Anthropic → API-key providers
Vision: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom
Payment fallback: auto-retries with next provider on HTTP 402 (insufficient credits).
Context Compression (trajectory_compressor.py, agent/context_compressor.py)
Strategy
Messages: [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
├─ PROTECTED ─┤ ├── COMPRESSED ──┤ ├─ PROTECTED ──┤
first system, middle turns last N turns
first human, (summarized by (recent context)
first assistant auxiliary LLM)
Configuration
- Trigger: When tokens >= 50% of model's context window (
threshold: 0.50) - Target: Keep 20% of threshold as recent context (
target_ratio: 0.20) - Protected: First system + first human + first assistant + last N turns
- Method: Summarize middle region via auxiliary LLM, replace with single human message
Context Probing
When context-overflow errors occur:
- Step down through probe tiers
- Cache discovered limits to disk (only confirmed values from error messages)
- Falls back to guessed tiers (in-memory only, not persisted)
Streaming & Callbacks
| Callback | Purpose |
|---|---|
stream_delta_callback |
Text chunks during streaming (feeds TTS pipeline) |
reasoning_callback |
Structured reasoning blocks |
thinking_callback |
Spinner animation status |
tool_progress_callback |
Tool execution progress |
interim_assistant_callback |
Intermediate messages before tools |
step_callback |
Per-iteration with prior tool results (gateway hooks) |
status_callback |
Status updates (rate limits, compression, etc.) |
Iteration Budget (run_agent.py:170-255)
class IterationBudget:
def __init__(self, total: int = 90):
self.total = total
self.remaining = total
self._lock = threading.Lock()
def consume(self) -> bool:
with self._lock:
if self.remaining <= 0:
return False
self.remaining -= 1
return True- Shared across parent agent and all subagents (via
delegate_task) - Each LLM turn consumes one unit
- Grace call: one extra attempt when budget hits zero (let model finish)
- Prevents runaway delegation chains from burning through API credits
Session Persistence (hermes_state.py)
SQLite Schema
CREATE TABLE sessions (
session_id TEXT PRIMARY KEY,
parent_session_id TEXT, -- compression chains
title TEXT,
source TEXT, -- 'cli', 'telegram', 'discord', etc.
started_at TIMESTAMP,
last_active TIMESTAMP,
message_count INTEGER,
prompt_tokens INTEGER,
completion_tokens INTEGER,
total_tokens INTEGER,
estimated_cost REAL,
actual_cost REAL
);
CREATE TABLE messages (
id INTEGER PRIMARY KEY,
session_id TEXT,
role TEXT, -- 'system', 'user', 'assistant', 'tool'
content TEXT,
tool_calls TEXT, -- JSON
tool_call_id TEXT,
finish_reason TEXT,
reasoning TEXT,
created_at TIMESTAMP
);
CREATE VIRTUAL TABLE messages_fts USING fts5(content, content=messages);Key Implementation Details
- WAL mode for concurrent readers + single writer (gateway multi-platform)
BEGIN IMMEDIATEwith jittered retry (20-150ms) to avoid SQLite convoy- Checkpoint every 50 writes to manage WAL file growth
- FTS5 for full-text session search across all historical conversations