Hermes Agent - LLM Usage Patterns & Guardrails
How LLMs Are Leveraged
Hermes Agent uses LLMs in four distinct roles, each with different requirements:
┌──────────────────────────────────────────────────────────────┐
│ LLM USAGE ROLES │
├──────────────────────────────────────────────────────────────┤
│ │
│ 1. PRIMARY AGENT 2. AUXILIARY (SIDE TASKS) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Tool calling │ │ Compression │ │
│ │ Reasoning │ │ Summarization│ │
│ │ User dialog │ │ Vision │ │
│ │ │ │ Web extract │ │
│ │ Premium model│ │ Cheap model │ │
│ │ (user choice)│ │ (auto-detect)│ │
│ └──────────────┘ └──────────────┘ │
│ │
│ 3. SUBAGENT 4. RESEARCH (BATCH) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Delegated │ │ Trajectory │ │
│ │ tasks │ │ generation │ │
│ │ │ │ RL training │ │
│ │ Same or │ │ SWE bench │ │
│ │ different │ │ │ │
│ │ model │ │ Configurable │ │
│ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Provider Abstraction
Multi-Provider Architecture
Hermes supports any OpenAI-compatible endpoint plus native Anthropic and AWS Bedrock:
# API modes (run_agent.py:559-657)
api_mode = "chat_completions" # OpenAI-compatible (default)
api_mode = "anthropic_messages" # Native Anthropic
api_mode = "codex_responses" # ChatGPT Codex OAuth
api_mode = "bedrock_converse" # AWS BedrockProvider Auto-Detection
# Simplified from run_agent.py constructor
if provider == "anthropic" or base_url.endswith("/anthropic"):
api_mode = "anthropic_messages"
elif provider == "bedrock" or "bedrock-runtime" in base_url:
api_mode = "bedrock_converse"
elif provider == "openai-codex":
api_mode = "codex_responses"
else:
api_mode = "chat_completions"Auxiliary Client Resolution (agent/auxiliary_client.py)
Side tasks (compression, vision, search summarization) use a separate resolution chain to find the cheapest available provider:
Text tasks: OpenRouter → Nous → Custom → Codex → Anthropic → API-key providers
Vision tasks: Main (if capable) → OpenRouter → Nous → Codex → Anthropic → Custom
Payment fallback: Auto-retries with next provider on HTTP 402 (insufficient credits).
Why This Matters
The primary model might be expensive (Claude Opus, GPT-4). Context compression, web extraction, and session search summarization don't need a premium model. The auxiliary client routes these to cheaper alternatives (Gemini Flash, Haiku) automatically, reducing costs significantly.
Prompt Engineering Patterns
System Prompt Architecture (agent/prompt_builder.py)
The system prompt is a carefully structured document assembled from multiple sources:
1. Core Identity
└── Hermes Agent identity, capabilities, behavioral guidelines
2. Memory Blocks (frozen snapshot)
└── MEMORY.md + USER.md content (injected at session start)
3. Skills Index
└── Compact listing of all available skills by category
└── Two-layer cache: in-process LRU + disk snapshot
4. Platform Hints
└── CLI/Telegram/Discord/WhatsApp formatting guidance
5. Context Files (optional)
└── SOUL.md (personality), AGENTS.md, .cursorrules
6. Behavioral Nudges
└── Memory save reminders (every ~10 turns)
└── Skill creation prompts (after complex tasks)
└── Session search guidance (for cross-session recall)
Prefix Cache Optimization
The critical insight: Anthropic's prompt caching gives a discount when the system prompt prefix is identical across API calls.
API Call 1: [SYSTEM PROMPT] [user1]
API Call 2: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2]
API Call 3: [SYSTEM PROMPT] [user1] [asst1] [tool1] [user2] [asst2] [user3]
╰──── CACHED PREFIX (identical across calls) ────╯
Design decisions driven by this:
- System prompt built once per session, stored in SessionDB, reused for all turns
- Memory snapshot frozen at session start (writes don't change system prompt)
- Dynamic context injected into user message, not system prompt
- Skills index cached to disk with content-hash validation
Ephemeral Context Injection
Memory and plugin context are injected into the user message at API-call time:
# run_agent.py:8525-8641 (simplified)
effective_user_message = user_message
# Memory provider context (if any)
if memory_context:
effective_user_message = f"<memory-context>{memory_context}</memory-context>\n{user_message}"
# Plugin hook context
if plugin_context:
effective_user_message = f"<plugin-context>{plugin_context}</plugin-context>\n{effective_user_message}"These are API-call-time only, never persisted to session DB.
Context Window Management
Proactive Compression
Context fills up:
│
├── Threshold check: tokens >= 50% of context window?
│ └── Configurable: config.yaml → compression.threshold: 0.50
│
├── Before main loop (preflight):
│ └── Compress if loaded history already exceeds threshold
│
└── After each tool execution:
└── Check should_compress() and trigger if needed
Compression Strategy
Messages: [sys] [user1] [asst1] [tool1] ... [toolN] [userM] [asstM]
├── PROTECTED ──┤ ├── COMPRESSED ──┤ ├── PROTECTED ──┤
Protected: First system + first human + first assistant + last N turns
Compressed: Middle turns summarized by auxiliary LLM
Target: Keep 20% of threshold as recent context
Context Probing
When context-overflow errors occur:
Error 413 or context_overflow
│
├── Parse available token count from error message
│ └── If found: cache to disk (confirmed value)
│
├── Step down probe tiers
│ └── In-memory only (not persisted)
│
└── Trigger compression with adjusted limits
Token Usage & Cost Tracking
Usage Normalization (run_agent.py:9219-9325)
def normalize_usage(response, api_mode):
"""Extract token counts across all providers into common format."""
return {
"prompt_tokens": ...,
"completion_tokens": ...,
"total_tokens": ...,
"cache_read_tokens": ..., # Anthropic/OpenRouter
"cache_creation_tokens": ..., # Anthropic
"reasoning_tokens": ..., # O1/O3 models
}Cache Hit Tracking
# Track cache performance for Anthropic prompt caching
cache_hit_pct = cache_read_tokens / prompt_tokens * 100
# Reported in /usage and /insightsCost Estimation
estimate_usage_cost(usage, model, provider)
# Per-model pricing tables for accurate cost tracking
# Persisted to SessionDB for /insights analyticsTool Call Handling
Schema Format
Tools are provided to the LLM in OpenAI function-calling format:
{
"type": "function",
"function": {
"name": "terminal",
"description": "Execute a shell command",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"},
"timeout": {"type": "integer"}
},
"required": ["command"]
}
}
}Tool Call Validation & Repair
The agent implements three layers of validation before executing tool calls:
1. Tool Name Repair (run_agent.py:10389-10436):
def _repair_tool_call(self, wrong_name):
"""Fuzzy match against registered tools.
Returns best match or None (max 3 retries)."""2. JSON Argument Validation (run_agent.py:10440-10527):
# Empty string → {} (common model quirk)
# Truncated JSON → detect and request continuation
# Malformed JSON → inject error for self-correction (max 3 retries)3. Behavioral Guardrails (run_agent.py:10532-10538):
# Cap delegate_task: one per turn
# Deduplicate identical tool calls in same turnParallel vs Sequential Execution
def _should_parallelize_tool_batch(self, tool_calls):
# Single tool → sequential
# Multiple read-only tools → concurrent
# File I/O → concurrent only if paths don't overlap
# Destructive commands (rm, mv) → always sequentialReasoning / Thinking Support
Extended Thinking (Anthropic)
# agent/anthropic_adapter.py
# Adaptive thinking budget mapping:
# xhigh → max
# high → high
# medium → medium
# low → low
# minimal → none
# Beta headers for interleaved thinking:
headers = {"anthropic-beta": "interleaved-thinking-2025-05-14"}<think> Block Extraction
For models that use XML-style thinking (DeepSeek, Qwen):
# run_agent.py:_build_assistant_message (line 6723)
# Extract <think>...</think> blocks from response
# Store as reasoning_content (separate from visible response)
# Preserve reasoning_details for multi-turn continuityReasoning Details Preservation
For multi-turn conversations with thinking models:
reasoning_detailsfrom prior turns are carried forward- Enables models to build on previous reasoning chains
- Provider-specific: Moonshot, Novita, OpenRouter pass reasoning_content
Guardrails & Safety
LLM Output Guardrails
| Guardrail | Purpose | Location |
|---|---|---|
| Iteration budget | Prevent infinite tool loops | IterationBudget (line 170) |
| Delegation depth limit | Prevent recursive delegation | MAX_DEPTH=2 in delegate_tool.py |
| Tool call deduplication | Prevent redundant executions | line 10532 |
| Empty response detection | Detect and retry silent completions | line 10769 |
| Truncation handling | Detect finish_reason=length and continue |
line 9042 |
| Invalid tool repair | Auto-fix misspelled tool names | line 10389 |
| JSON validation | Parse and recover malformed arguments | line 10440 |
Input Guardrails
| Guardrail | Purpose | Location |
|---|---|---|
| Surrogate stripping | Remove invalid UTF-8 | line 8145 |
| Memory block stripping | Remove leaked <memory-context> |
line 8155 |
| Content scanning | Tirith security scanner | tirith_security.py |
| SSRF prevention | Block private IPs in URLs | url_safety.py |
| Memory injection defense | Block prompt injection in memory writes | memory_tool.py:65-81 |
Output Guardrails
| Guardrail | Purpose | Location |
|---|---|---|
| Tool result truncation | Per-tool max_result_size_chars |
registry.py:315 |
| Dangerous command approval | User confirmation for destructive ops | approval.py |
| Credential filtering | Strip API keys from subprocess env | local.py |
| Path security | Block writes to system paths | file_tools.py:94 |
Multi-Provider Fallback Strategy
Fallback Chain
Primary provider fails (429, 500, timeout)
│
├── Step 1: Retry with same provider (up to 3x)
│ └── Jittered exponential backoff (5s base, 120s cap)
│
├── Step 2: Credential pool rotation
│ └── Try next API key from pool (if configured)
│
├── Step 3: Primary transport recovery
│ └── Rebuild HTTP client once (connection issues)
│
└── Step 4: Activate fallback provider
└── Switch to configured fallback model/provider
└── Stay on fallback until primary recovers
Eager Fallback
For rate-limited/empty responses, Hermes can switch to fallback immediately without exhausting retries. This minimizes user-visible latency.
Credential Pool Rotation
# config.yaml
openrouter:
api_keys:
- "sk-or-key1"
- "sk-or-key2"
- "sk-or-key3"On 429/403, the next key in the pool is tried before escalating to backoff.
Model-Specific Adaptations
Per-Model Output Limits
# agent/anthropic_adapter.py
OUTPUT_LIMITS = {
"claude-opus-4.6": 128_000,
"claude-sonnet-4.6": 128_000,
"claude-haiku-4.5": 64_000,
# ...
}Per-Model Context Windows
Auto-detected from OpenRouter model metadata or provider error messages. Cached to disk for confirmed values.
Provider-Specific Headers
# OpenRouter: prompt caching, data policy, provider routing
# Anthropic: interleaved thinking beta, extended output
# Bedrock: model-specific content type
# Codex: OAuth token, session managementBatch & Research Patterns
Trajectory Generation (batch_runner.py)
# Parallel batch processing for training data
batch_runner.py --dataset_file=data.jsonl --batch_size=10
# Toolset distributions for variability
batch_runner.py --distribution=mixed_tasksEach trajectory captures:
- System prompt, user messages, assistant responses
- Tool calls and their results
- Token usage and timing
- Finish reasons and error states
SWE Benchmark (mini_swe_runner.py)
Minimal agent with single terminal tool, outputting Hermes trajectory format. Compatible with standard SWE evaluation harnesses.
RL Training (rl_cli.py)
Integration with Atropos RL environments:
- Extended timeouts for training
- RL-focused system prompts
- Wandb tracking
- Tinker integration