Hermes Agent - LLM Usage Patterns & Guardrails

How LLMs Are Leveraged

Hermes Agent uses LLMs in four distinct roles, each with different requirements:

┌──────────────────────────────────────────────────────────────┐
│                     LLM USAGE ROLES                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  1. PRIMARY AGENT         2. AUXILIARY (SIDE TASKS)          │
│  ┌──────────────┐         ┌──────────────┐                  │
│  │ Tool calling │         │ Compression  │                  │
│  │ Reasoning    │         │ Summarization│                  │
│  │ User dialog  │         │ Vision       │                  │
│  │              │         │ Web extract  │                  │
│  │ Premium model│         │ Cheap model  │                  │
│  │ (user choice)│         │ (auto-detect)│                  │
│  └──────────────┘         └──────────────┘                  │
│                                                              │
│  3. SUBAGENT               4. RESEARCH (BATCH)              │
│  ┌──────────────┐         ┌──────────────┐                  │
│  │ Delegated    │         │ Trajectory   │                  │
│  │ tasks        │         │ generation   │                  │
│  │              │         │ RL training  │                  │
│  │ Same or      │         │ SWE bench    │                  │
│  │ different    │         │              │                  │
│  │ model        │         │ Configurable │                  │
│  └──────────────┘         └──────────────┘                  │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Provider Abstraction

Multi-Provider Architecture

Hermes supports any OpenAI-compatible endpoint plus native Anthropic and AWS Bedrock:

# API modes (run_agent.py:559-657)
api_mode = "chat_completions"      # OpenAI-compatible (default)
api_mode = "anthropic_messages"    # Native Anthropic
api_mode = "codex_responses"       # ChatGPT Codex OAuth
api_mode = "bedrock_converse"      # AWS Bedrock

Provider Auto-Detection

# Simplified from run_agent.py constructor
if provider == "anthropic" or base_url.endswith("/anthropic"):
    api_mode = "anthropic_messages"
elif provider == "bedrock" or "bedrock-runtime" in base_url:
    api_mode = "bedrock_converse"
elif provider == "openai-codex":
    api_mode = "codex_responses"
else:
    api_mode = "chat_completions"

Auxiliary Client Resolution (`agent/auxiliary_client.py`)

Side tasks (compression, vision, search summarization) use a separate resolution chain to find the cheapest available provider:

Text tasks:   OpenRouter → Nous → Custom → Codex → Anthropic → API-key providers
Vision tasks: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom

Payment fallback: Auto-retries with next provider on HTTP 402 (insufficient credits).

The primary model might be expensive (Claude Opus, GPT-4). Context compression, web extraction, and session search summarization don't need a premium model. The auxiliary client routes these to cheaper alternatives (Gemini Flash, Haiku) automatically, reducing costs significantly.

Prompt Engineering Patterns

System Prompt Architecture (`agent/prompt_builder.py`)

The system prompt is a carefully structured document assembled from multiple sources:

1. Core Identity
   └── Hermes Agent identity, capabilities, behavioral guidelines

2. Memory Blocks (frozen snapshot)
   └── MEMORY.md + USER.md content (injected at session start)

3. Skills Index
   └── Compact listing of all available skills by category
   └── Two-layer cache: in-process LRU + disk snapshot

4. Platform Hints
   └── CLI/Telegram/Discord/WhatsApp formatting guidance

5. Context Files (optional)
   └── SOUL.md (personality), AGENTS.md, .cursorrules

6. Behavioral Nudges
   └── Memory save reminders (every ~10 turns)
   └── Skill creation prompts (after complex tasks)
   └── Session search guidance (for cross-session recall)

Prefix Cache Optimization

The critical insight: Anthropic's prompt caching gives a discount when the system prompt prefix is identical across API calls.

API Call 1: [SYSTEM PROMPT] [user1]
API Call 2: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2]
API Call 3: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2] [asst2] [user3]
             ╰──── CACHED PREFIX (identical across calls) ────╯

Design decisions driven by this:

System prompt built once per session, stored in SessionDB, reused for all turns
Memory snapshot frozen at session start (writes don't change system prompt)
Dynamic context injected into user message, not system prompt
Skills index cached to disk with content-hash validation

Ephemeral Context Injection

Memory and plugin context are injected into the user message at API-call time:

# run_agent.py:8525-8641 (simplified)
effective_user_message = user_message
 
# Memory provider context (if any)
if memory_context:
    effective_user_message = f"<memory-context>{memory_context}</memory-context>\n{user_message}"
 
# Plugin hook context
if plugin_context:
    effective_user_message = f"<plugin-context>{plugin_context}</plugin-context>\n{effective_user_message}"

These are API-call-time only, never persisted to session DB.

Context Window Management

Proactive Compression

Context fills up:
    │
    ├── Threshold check: tokens >= 50% of context window?
    │   └── Configurable: config.yaml → compression.threshold: 0.50
    │
    ├── Before main loop (preflight):
    │   └── Compress if loaded history already exceeds threshold
    │
    └── After each tool execution:
        └── Check should_compress() and trigger if needed

Compression Strategy

Messages: [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
           ├── PROTECTED ──┤  ├── COMPRESSED ──┤  ├── PROTECTED ──┤

Protected:  First system + first human + first assistant + last N turns
Compressed: Middle turns summarized by auxiliary LLM
Target:     Keep 20% of threshold as recent context

Context Probing

When context-overflow errors occur:

Error 413 or context_overflow
    │
    ├── Parse available token count from error message
    │   └── If found: cache to disk (confirmed value)
    │
    ├── Step down probe tiers
    │   └── In-memory only (not persisted)
    │
    └── Trigger compression with adjusted limits

Token Usage & Cost Tracking

Usage Normalization (`run_agent.py:9219-9325`)

def normalize_usage(response, api_mode):
    """Extract token counts across all providers into common format."""
    return {
        "prompt_tokens": ...,
        "completion_tokens": ...,
        "total_tokens": ...,
        "cache_read_tokens": ...,     # Anthropic/OpenRouter
        "cache_creation_tokens": ..., # Anthropic
        "reasoning_tokens": ...,      # O1/O3 models
    }

Cache Hit Tracking

# Track cache performance for Anthropic prompt caching
cache_hit_pct = cache_read_tokens / prompt_tokens * 100
# Reported in /usage and /insights

Cost Estimation

estimate_usage_cost(usage, model, provider)
# Per-model pricing tables for accurate cost tracking
# Persisted to SessionDB for /insights analytics

Tool Call Handling

Schema Format

Tools are provided to the LLM in OpenAI function-calling format:

{
    "type": "function",
    "function": {
        "name": "terminal",
        "description": "Execute a shell command",
        "parameters": {
            "type": "object",
            "properties": {
                "command": {"type": "string"},
                "timeout": {"type": "integer"}
            },
            "required": ["command"]
        }
    }
}

Tool Call Validation & Repair

The agent implements three layers of validation before executing tool calls:

1. Tool Name Repair (run_agent.py:10389-10436):

def _repair_tool_call(self, wrong_name):
    """Fuzzy match against registered tools.
    Returns best match or None (max 3 retries)."""

2. JSON Argument Validation (run_agent.py:10440-10527):

# Empty string → {} (common model quirk)
# Truncated JSON → detect and request continuation
# Malformed JSON → inject error for self-correction (max 3 retries)

3. Behavioral Guardrails (run_agent.py:10532-10538):

# Cap delegate_task: one per turn
# Deduplicate identical tool calls in same turn

Parallel vs Sequential Execution

def _should_parallelize_tool_batch(self, tool_calls):
    # Single tool → sequential
    # Multiple read-only tools → concurrent
    # File I/O → concurrent only if paths don't overlap
    # Destructive commands (rm, mv) → always sequential

Reasoning / Thinking Support

Extended Thinking (Anthropic)

# agent/anthropic_adapter.py
# Adaptive thinking budget mapping:
#   xhigh → max
#   high  → high
#   medium → medium
#   low → low
#   minimal → none
 
# Beta headers for interleaved thinking:
headers = {"anthropic-beta": "interleaved-thinking-2025-05-14"}

`<think>` Block Extraction

For models that use XML-style thinking (DeepSeek, Qwen):

# run_agent.py:_build_assistant_message (line 6723)
# Extract <think>...</think> blocks from response
# Store as reasoning_content (separate from visible response)
# Preserve reasoning_details for multi-turn continuity

Reasoning Details Preservation

For multi-turn conversations with thinking models:

reasoning_details from prior turns are carried forward
Enables models to build on previous reasoning chains
Provider-specific: Moonshot, Novita, OpenRouter pass reasoning_content

Guardrails & Safety

LLM Output Guardrails

Guardrail	Purpose	Location
Iteration budget	Prevent infinite tool loops	`IterationBudget` (line 170)
Delegation depth limit	Prevent recursive delegation	`MAX_DEPTH=2` in delegate_tool.py
Tool call deduplication	Prevent redundant executions	line 10532
Empty response detection	Detect and retry silent completions	line 10769
Truncation handling	Detect `finish_reason=length` and continue	line 9042
Invalid tool repair	Auto-fix misspelled tool names	line 10389
JSON validation	Parse and recover malformed arguments	line 10440

Input Guardrails

Guardrail	Purpose	Location
Surrogate stripping	Remove invalid UTF-8	line 8145
Memory block stripping	Remove leaked `<memory-context>`	line 8155
Content scanning	Tirith security scanner	`tirith_security.py`
SSRF prevention	Block private IPs in URLs	`url_safety.py`
Memory injection defense	Block prompt injection in memory writes	`memory_tool.py:65-81`

Output Guardrails

Guardrail	Purpose	Location
Tool result truncation	Per-tool `max_result_size_chars`	`registry.py:315`
Dangerous command approval	User confirmation for destructive ops	`approval.py`
Credential filtering	Strip API keys from subprocess env	`local.py`
Path security	Block writes to system paths	`file_tools.py:94`

Multi-Provider Fallback Strategy

Fallback Chain

Primary provider fails (429, 500, timeout)
    │
    ├── Step 1: Retry with same provider (up to 3x)
    │   └── Jittered exponential backoff (5s base, 120s cap)
    │
    ├── Step 2: Credential pool rotation
    │   └── Try next API key from pool (if configured)
    │
    ├── Step 3: Primary transport recovery
    │   └── Rebuild HTTP client once (connection issues)
    │
    └── Step 4: Activate fallback provider
        └── Switch to configured fallback model/provider
        └── Stay on fallback until primary recovers

Eager Fallback

For rate-limited/empty responses, Hermes can switch to fallback immediately without exhausting retries. This minimizes user-visible latency.

Credential Pool Rotation

# config.yaml
openrouter:
  api_keys:
    - "sk-or-key1"
    - "sk-or-key2"
    - "sk-or-key3"

On 429/403, the next key in the pool is tried before escalating to backoff.

Model-Specific Adaptations

Per-Model Output Limits

# agent/anthropic_adapter.py
OUTPUT_LIMITS = {
    "claude-opus-4.6":   128_000,
    "claude-sonnet-4.6": 128_000,
    "claude-haiku-4.5":   64_000,
    # ...
}

Per-Model Context Windows

Auto-detected from OpenRouter model metadata or provider error messages. Cached to disk for confirmed values.

Provider-Specific Headers

# OpenRouter: prompt caching, data policy, provider routing
# Anthropic: interleaved thinking beta, extended output
# Bedrock: model-specific content type
# Codex: OAuth token, session management

Batch & Research Patterns

Trajectory Generation (`batch_runner.py`)

# Parallel batch processing for training data
batch_runner.py --dataset_file=data.jsonl --batch_size=10
 
# Toolset distributions for variability
batch_runner.py --distribution=mixed_tasks

Each trajectory captures:

System prompt, user messages, assistant responses
Tool calls and their results
Token usage and timing
Finish reasons and error states

SWE Benchmark (`mini_swe_runner.py`)

Minimal agent with single terminal tool, outputting Hermes trajectory format. Compatible with standard SWE evaluation harnesses.

RL Training (`rl_cli.py`)

Integration with Atropos RL environments:

Extended timeouts for training
RL-focused system prompts
Wandb tracking
Tinker integration