Hermes Agent - Design Patterns & Tradeoffs

Architectural Patterns

1. Self-Registering Modules (Plugin Pattern)

Where: Tool registration (tools/registry.py, model_tools.py)

# Each tool module registers itself at import time:
# tools/file_tools.py
from tools.registry import registry
 
registry.register(
    name="read_file",
    toolset="file",
    schema=READ_FILE_SCHEMA,
    handler=_handle_read_file,
    check_fn=_check_file_reqs,
)

Discovery:

# model_tools.py uses AST inspection to find modules with register() calls
# Then imports them, triggering registration
discover_builtin_tools()

Why it's clever: Adding a new tool requires zero configuration. Drop a .py file in tools/, call registry.register(), and AST-based discovery finds it automatically. No import lists, no config files, no registration boilerplate.

Tradeoff: Import-time side effects can make debugging harder. Module load order matters if tools depend on other tools.

2. Frozen Snapshot Pattern

Where: Memory system (tools/memory_tool.py)

Session Start → Load MEMORY.md → Create frozen snapshot → Inject into system prompt
                                                                    │
During Session: memory writes → disk (durable)              prompt unchanged
                                                                    │
Next Session → Load updated MEMORY.md → New snapshot ───────────────┘

Why it's clever: This is the key to making Anthropic's prompt caching work. The system prompt is the cache key. If memory writes changed the system prompt mid-session, every memory_tool(add) call would invalidate the cache, increasing costs 4-10x. By freezing the snapshot, the prefix stays identical across all API calls in a session.

Tradeoff: The agent doesn't see its own memory writes during the current session. This is acceptable because:

The agent already knows what it wrote (it just wrote it)
Memory is for cross-session persistence, not intra-session state
The todo tool handles intra-session task tracking

3. Adapter Pattern (Provider Abstraction)

Where: LLM providers (agent/anthropic_adapter.py, agent/bedrock_adapter.py), platform adapters (gateway/platforms/)

                    Common Interface
                         │
          ┌──────────────┼──────────────┐
          │              │              │
    ┌─────┴─────┐  ┌─────┴─────┐  ┌────┴──────┐
    │ Anthropic │  │ OpenAI-   │  │  Bedrock  │
    │ Messages  │  │ Compatible│  │  Converse │
    └───────────┘  └───────────┘  └───────────┘

For the gateway:

                BasePlatformAdapter
                      │
    ┌─────────────────┼─────────────────┐
    │                 │                 │
    Telegram      Discord           Slack
    Adapter       Adapter           Adapter

Why: A single run_conversation() method works across all LLM providers. A single GatewayRunner._handle_message() works across all platforms. New providers/platforms are added by implementing the adapter interface.

4. Thread Pool Concurrency (Parallel Tool Execution)

Where: run_agent.py:_execute_tool_calls_concurrent (line 7294)

# Decide parallelization:
# - Single tool → sequential
# - Multiple read-only tools → concurrent  
# - File I/O → concurrent only if paths don't overlap
# - Destructive commands (rm, mv) → always sequential
 
with ThreadPoolExecutor(max_workers=N) as executor:
    futures = [executor.submit(_invoke_tool, tc) for tc in batch]
    results = [f.result() for f in futures]  # preserve order

Why it's clever: The agent can make multiple independent API calls, file reads, or web searches simultaneously. The path overlap detection (_paths_overlap(), line 328) prevents race conditions on file operations without being overly conservative.

Tradeoff: Thread pool is used instead of asyncio because many tool handlers are synchronous. The async bridge (_run_async()) handles async tools by running them in the thread pool's event loop.

5. Progressive Disclosure (Skills)

Where: Skills system (tools/skills_tool.py)

Tier 1: skills_list()          → name + description (10 tokens per skill)
Tier 2: skill_view(name)       → full SKILL.md (100-1000 tokens)
Tier 3: skill_view(name, file) → linked files (variable)

Why: A skill library with 100 skills would consume 10,000-100,000 tokens if fully loaded. Progressive disclosure lets the agent scan the index cheaply and only load what it needs.

Tradeoff: The agent needs two tool calls to fully load a skill (list → view). This adds one LLM turn but saves significant context budget.

6. Circuit Breaker (MCP Tool Handler)

Where: tools/mcp_tool.py

# Auto-reconnection with exponential backoff
# Up to 5 retries
# Fail-open after max retries (tool becomes unavailable)
# Dynamic tool discovery: listen for tools/list_changed notifications

Why: MCP servers are external processes that can crash, hang, or become unreachable. The circuit breaker prevents the agent from blocking on a dead server.

7. Iteration Budget (Shared Resource Limiter)

Where: run_agent.py:170-255

class IterationBudget:
    """Shared across parent + all subagents. Thread-safe."""
    total: int = 90
    remaining: int  # decremented by each LLM turn
    _lock: threading.Lock

Why it's clever: Without a shared budget, a delegation chain (parent → child → child's child) could burn through unlimited API calls. The budget is shared: if the parent uses 20 turns and delegates with a budget of 90, the child only has 70 turns remaining.

Grace call: One extra attempt when budget hits zero, so the model can produce a final summary instead of being cut off mid-thought.

8. WAL Mode with Convoy Prevention (SQLite)

Where: hermes_state.py:164-196

def _execute_write(self, sql, params):
    """BEGIN IMMEDIATE with jittered retry to avoid SQLite convoy."""
    for attempt in range(max_retries):
        try:
            with self._conn:
                self._conn.execute("BEGIN IMMEDIATE")
                self._conn.execute(sql, params)
            break
        except sqlite3.OperationalError as e:
            if "database is locked" in str(e):
                sleep(random.uniform(0.02, 0.15))  # jittered 20-150ms

Why: The gateway serves multiple platforms concurrently. WAL mode allows concurrent readers with a single writer. The jittered retry prevents the SQLite convoy problem where multiple writers synchronize on the same retry interval.

Checkpoint: Every 50 writes to manage WAL file growth.

Notable Design Tradeoffs

1. Giant Files vs. Module Decomposition

run_agent.py is 11,500 lines. cli.py is 10,000 lines. gateway/run.py is 9,800 lines.

Why they're monolithic: These files are the core orchestrators. Breaking them up would:

Introduce import cycles (they reference each other's internals)
Make the execution flow harder to trace (grep works well in one file)
Add abstraction layers that don't earn their keep

Tradeoff: IDE navigation is harder. New contributors face a wall of code. But the alternative (dozens of small files with tangled imports) would be worse for a project of this complexity.

2. OpenAI SDK as Universal Client

Every provider is accessed through the OpenAI Python SDK's openai.OpenAI(base_url=...) pattern, with Anthropic and Bedrock as special cases.

Pro: One code path handles 200+ models. New providers "just work" if they're OpenAI-compatible. Con: Provider-specific features (Anthropic thinking, Gemini thought signatures) require adapter-level special cases. The abstraction leaks at the edges.

3. Subprocess-Based Environments

Docker, SSH, Singularity environments are driven via subprocess calls rather than client libraries.

Pro: Zero dependency overhead. Works identically on any system with the CLI installed. No API version mismatches. Con: Parsing subprocess output is fragile. Error handling relies on exit codes and stderr patterns. No structured data from the environment.

4. SQLite for Everything

Sessions, messages, FTS search, token tracking, cost accounting - all in SQLite.

Pro: Zero-dependency, file-based, portable. FTS5 enables cross-session search without Elasticsearch. WAL mode handles gateway concurrency. Con: Single-writer bottleneck under high concurrency. No built-in replication. Large session databases can grow to hundreds of MB.

5. Memory in Markdown Files

MEMORY.md and USER.md are plain markdown with § delimiters.

Pro: Human-readable. Editable with any text editor. No database dependency. Con: Character limits (2,200 / 1,375) are small. No semantic search - the entire contents must fit in the context window. The § delimiter is unconventional.

6. Skills as Filesystem

Skills are directories with markdown files, not database records.

Pro: Git-friendly. Shareable as repos. Editable with any editor. No migration scripts. Con: No atomic multi-file updates. Race conditions possible if two sessions create skills simultaneously. Discovery requires filesystem scanning (mitigated by caching).

What's Unusual or Clever

1. RPC Stub Generation for Code Execution

code_execution_tool.py generates a hermes_tools.py file that scripts can import:

# Generated hermes_tools.py
def web_search(query):
    """Calls back to parent agent via UDS or file-based RPC."""
    return _rpc_call("web_search", {"query": query})
 
def read_file(path):
    return _rpc_call("read_file", {"file_path": path})

This gives Python scripts zero-context-cost tool access. The script runs in a subprocess with RPC back to the parent. Tool results never pollute the conversation context.

2. Credential Pool Rotation

openrouter:
  api_keys: ["key1", "key2", "key3"]

On rate limit (429), the next key is tried before escalating to backoff. This multiplies effective rate limits by the number of keys.

3. Stale-Stream Health Checking

Even when streaming isn't needed for display, the agent uses streaming for health checking:

90-second stale-stream detection (no chunks → assume dead)
60-second per-chunk read timeout

This catches hung connections that non-streaming requests would wait on indefinitely.

4. Orphaned Tool Result Stripping

If the conversation history has tool results without matching tool calls (from a crashed prior turn), they're stripped before the API call. This prevents confusing the model with phantom tool outputs.

5. Pairing Code Authorization

For gateway access, unknown users receive a pairing code. They send it back to prove they have physical access to the admin's device. No OAuth flow, no database - just a short-lived code exchange.

6. Dual-Cache Skills Index

# Layer 1: In-process LRU cache (instant, lost on restart)
@lru_cache(maxsize=1)
def _cached_skills_index():
    ...
 
# Layer 2: Disk snapshot (.skills_prompt_snapshot.json)
# Survives restarts, validated by content hash

The skills index needs to be built from filesystem scanning (walking directories, parsing YAML frontmatter). This dual cache ensures the first API call of a session is fast.

Anti-Patterns Avoided

1. No ORM

SQLite is accessed directly with raw SQL. For a single-table schema with FTS5, an ORM would add complexity without value.

2. No Microservices

The gateway is a single process managing all platforms. This avoids:

Inter-service communication overhead
Deployment complexity
Distributed state management