CodeDocs Vault

Hermes Agent - LLM Usage Patterns & Guardrails

How LLMs Are Leveraged

Hermes Agent uses LLMs in four distinct roles, each with different requirements:

┌──────────────────────────────────────────────────────────────┐
│                     LLM USAGE ROLES                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  1. PRIMARY AGENT         2. AUXILIARY (SIDE TASKS)          │
│  ┌──────────────┐         ┌──────────────┐                  │
│  │ Tool calling │         │ Compression  │                  │
│  │ Reasoning    │         │ Summarization│                  │
│  │ User dialog  │         │ Vision       │                  │
│  │              │         │ Web extract  │                  │
│  │ Premium model│         │ Cheap model  │                  │
│  │ (user choice)│         │ (auto-detect)│                  │
│  └──────────────┘         └──────────────┘                  │
│                                                              │
│  3. SUBAGENT               4. RESEARCH (BATCH)              │
│  ┌──────────────┐         ┌──────────────┐                  │
│  │ Delegated    │         │ Trajectory   │                  │
│  │ tasks        │         │ generation   │                  │
│  │              │         │ RL training  │                  │
│  │ Same or      │         │ SWE bench    │                  │
│  │ different    │         │              │                  │
│  │ model        │         │ Configurable │                  │
│  └──────────────┘         └──────────────┘                  │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Provider Abstraction

Multi-Provider Architecture

Hermes supports any OpenAI-compatible endpoint plus native Anthropic and AWS Bedrock:

# API modes (run_agent.py:559-657)
api_mode = "chat_completions"      # OpenAI-compatible (default)
api_mode = "anthropic_messages"    # Native Anthropic
api_mode = "codex_responses"       # ChatGPT Codex OAuth
api_mode = "bedrock_converse"      # AWS Bedrock

Provider Auto-Detection

# Simplified from run_agent.py constructor
if provider == "anthropic" or base_url.endswith("/anthropic"):
    api_mode = "anthropic_messages"
elif provider == "bedrock" or "bedrock-runtime" in base_url:
    api_mode = "bedrock_converse"
elif provider == "openai-codex":
    api_mode = "codex_responses"
else:
    api_mode = "chat_completions"

Auxiliary Client Resolution (agent/auxiliary_client.py)

Side tasks (compression, vision, search summarization) use a separate resolution chain to find the cheapest available provider:

Text tasks:   OpenRouter → Nous → Custom → Codex → Anthropic → API-key providers
Vision tasks: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom

Payment fallback: Auto-retries with next provider on HTTP 402 (insufficient credits).

Why This Matters

The primary model might be expensive (Claude Opus, GPT-4). Context compression, web extraction, and session search summarization don't need a premium model. The auxiliary client routes these to cheaper alternatives (Gemini Flash, Haiku) automatically, reducing costs significantly.

Prompt Engineering Patterns

System Prompt Architecture (agent/prompt_builder.py)

The system prompt is a carefully structured document assembled from multiple sources:

1. Core Identity
   └── Hermes Agent identity, capabilities, behavioral guidelines

2. Memory Blocks (frozen snapshot)
   └── MEMORY.md + USER.md content (injected at session start)

3. Skills Index
   └── Compact listing of all available skills by category
   └── Two-layer cache: in-process LRU + disk snapshot

4. Platform Hints
   └── CLI/Telegram/Discord/WhatsApp formatting guidance

5. Context Files (optional)
   └── SOUL.md (personality), AGENTS.md, .cursorrules

6. Behavioral Nudges
   └── Memory save reminders (every ~10 turns)
   └── Skill creation prompts (after complex tasks)
   └── Session search guidance (for cross-session recall)

Prefix Cache Optimization

The critical insight: Anthropic's prompt caching gives a discount when the system prompt prefix is identical across API calls.

API Call 1: [SYSTEM PROMPT] [user1]
API Call 2: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2]
API Call 3: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2] [asst2] [user3]
             ╰──── CACHED PREFIX (identical across calls) ────╯

Design decisions driven by this:

  1. System prompt built once per session, stored in SessionDB, reused for all turns
  2. Memory snapshot frozen at session start (writes don't change system prompt)
  3. Dynamic context injected into user message, not system prompt
  4. Skills index cached to disk with content-hash validation

Ephemeral Context Injection

Memory and plugin context are injected into the user message at API-call time:

# run_agent.py:8525-8641 (simplified)
effective_user_message = user_message
 
# Memory provider context (if any)
if memory_context:
    effective_user_message = f"<memory-context>{memory_context}</memory-context>\n{user_message}"
 
# Plugin hook context
if plugin_context:
    effective_user_message = f"<plugin-context>{plugin_context}</plugin-context>\n{effective_user_message}"

These are API-call-time only, never persisted to session DB.

Context Window Management

Proactive Compression

Context fills up:
    │
    ├── Threshold check: tokens >= 50% of context window?
    │   └── Configurable: config.yaml → compression.threshold: 0.50
    │
    ├── Before main loop (preflight):
    │   └── Compress if loaded history already exceeds threshold
    │
    └── After each tool execution:
        └── Check should_compress() and trigger if needed

Compression Strategy

Messages: [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
           ├── PROTECTED ──┤  ├── COMPRESSED ──┤  ├── PROTECTED ──┤

Protected:  First system + first human + first assistant + last N turns
Compressed: Middle turns summarized by auxiliary LLM
Target:     Keep 20% of threshold as recent context

Context Probing

When context-overflow errors occur:

Error 413 or context_overflow
    │
    ├── Parse available token count from error message
    │   └── If found: cache to disk (confirmed value)
    │
    ├── Step down probe tiers
    │   └── In-memory only (not persisted)
    │
    └── Trigger compression with adjusted limits

Token Usage & Cost Tracking

Usage Normalization (run_agent.py:9219-9325)

def normalize_usage(response, api_mode):
    """Extract token counts across all providers into common format."""
    return {
        "prompt_tokens": ...,
        "completion_tokens": ...,
        "total_tokens": ...,
        "cache_read_tokens": ...,     # Anthropic/OpenRouter
        "cache_creation_tokens": ..., # Anthropic
        "reasoning_tokens": ...,      # O1/O3 models
    }

Cache Hit Tracking

# Track cache performance for Anthropic prompt caching
cache_hit_pct = cache_read_tokens / prompt_tokens * 100
# Reported in /usage and /insights

Cost Estimation

estimate_usage_cost(usage, model, provider)
# Per-model pricing tables for accurate cost tracking
# Persisted to SessionDB for /insights analytics

Tool Call Handling

Schema Format

Tools are provided to the LLM in OpenAI function-calling format:

{
    "type": "function",
    "function": {
        "name": "terminal",
        "description": "Execute a shell command",
        "parameters": {
            "type": "object",
            "properties": {
                "command": {"type": "string"},
                "timeout": {"type": "integer"}
            },
            "required": ["command"]
        }
    }
}

Tool Call Validation & Repair

The agent implements three layers of validation before executing tool calls:

1. Tool Name Repair (run_agent.py:10389-10436):

def _repair_tool_call(self, wrong_name):
    """Fuzzy match against registered tools.
    Returns best match or None (max 3 retries)."""

2. JSON Argument Validation (run_agent.py:10440-10527):

# Empty string → {} (common model quirk)
# Truncated JSON → detect and request continuation
# Malformed JSON → inject error for self-correction (max 3 retries)

3. Behavioral Guardrails (run_agent.py:10532-10538):

# Cap delegate_task: one per turn
# Deduplicate identical tool calls in same turn

Parallel vs Sequential Execution

def _should_parallelize_tool_batch(self, tool_calls):
    # Single tool → sequential
    # Multiple read-only tools → concurrent
    # File I/O → concurrent only if paths don't overlap
    # Destructive commands (rm, mv) → always sequential

Reasoning / Thinking Support

Extended Thinking (Anthropic)

# agent/anthropic_adapter.py
# Adaptive thinking budget mapping:
#   xhigh → max
#   high  → high
#   medium → medium
#   low → low
#   minimal → none
 
# Beta headers for interleaved thinking:
headers = {"anthropic-beta": "interleaved-thinking-2025-05-14"}

<think> Block Extraction

For models that use XML-style thinking (DeepSeek, Qwen):

# run_agent.py:_build_assistant_message (line 6723)
# Extract <think>...</think> blocks from response
# Store as reasoning_content (separate from visible response)
# Preserve reasoning_details for multi-turn continuity

Reasoning Details Preservation

For multi-turn conversations with thinking models:

Guardrails & Safety

LLM Output Guardrails

Guardrail Purpose Location
Iteration budget Prevent infinite tool loops IterationBudget (line 170)
Delegation depth limit Prevent recursive delegation MAX_DEPTH=2 in delegate_tool.py
Tool call deduplication Prevent redundant executions line 10532
Empty response detection Detect and retry silent completions line 10769
Truncation handling Detect finish_reason=length and continue line 9042
Invalid tool repair Auto-fix misspelled tool names line 10389
JSON validation Parse and recover malformed arguments line 10440

Input Guardrails

Guardrail Purpose Location
Surrogate stripping Remove invalid UTF-8 line 8145
Memory block stripping Remove leaked <memory-context> line 8155
Content scanning Tirith security scanner tirith_security.py
SSRF prevention Block private IPs in URLs url_safety.py
Memory injection defense Block prompt injection in memory writes memory_tool.py:65-81

Output Guardrails

Guardrail Purpose Location
Tool result truncation Per-tool max_result_size_chars registry.py:315
Dangerous command approval User confirmation for destructive ops approval.py
Credential filtering Strip API keys from subprocess env local.py
Path security Block writes to system paths file_tools.py:94

Multi-Provider Fallback Strategy

Fallback Chain

Primary provider fails (429, 500, timeout)
    │
    ├── Step 1: Retry with same provider (up to 3x)
    │   └── Jittered exponential backoff (5s base, 120s cap)
    │
    ├── Step 2: Credential pool rotation
    │   └── Try next API key from pool (if configured)
    │
    ├── Step 3: Primary transport recovery
    │   └── Rebuild HTTP client once (connection issues)
    │
    └── Step 4: Activate fallback provider
        └── Switch to configured fallback model/provider
        └── Stay on fallback until primary recovers

Eager Fallback

For rate-limited/empty responses, Hermes can switch to fallback immediately without exhausting retries. This minimizes user-visible latency.

Credential Pool Rotation

# config.yaml
openrouter:
  api_keys:
    - "sk-or-key1"
    - "sk-or-key2"
    - "sk-or-key3"

On 429/403, the next key in the pool is tried before escalating to backoff.

Model-Specific Adaptations

Per-Model Output Limits

# agent/anthropic_adapter.py
OUTPUT_LIMITS = {
    "claude-opus-4.6":   128_000,
    "claude-sonnet-4.6": 128_000,
    "claude-haiku-4.5":   64_000,
    # ...
}

Per-Model Context Windows

Auto-detected from OpenRouter model metadata or provider error messages. Cached to disk for confirmed values.

Provider-Specific Headers

# OpenRouter: prompt caching, data policy, provider routing
# Anthropic: interleaved thinking beta, extended output
# Bedrock: model-specific content type
# Codex: OAuth token, session management

Batch & Research Patterns

Trajectory Generation (batch_runner.py)

# Parallel batch processing for training data
batch_runner.py --dataset_file=data.jsonl --batch_size=10
 
# Toolset distributions for variability
batch_runner.py --distribution=mixed_tasks

Each trajectory captures:

SWE Benchmark (mini_swe_runner.py)

Minimal agent with single terminal tool, outputting Hermes trajectory format. Compatible with standard SWE evaluation harnesses.

RL Training (rl_cli.py)

Integration with Atropos RL environments: