Hermes Agent - Design Patterns & Tradeoffs
Architectural Patterns
1. Self-Registering Modules (Plugin Pattern)
Where: Tool registration (tools/registry.py, model_tools.py)
# Each tool module registers itself at import time:
# tools/file_tools.py
from tools.registry import registry
registry.register(
name="read_file",
toolset="file",
schema=READ_FILE_SCHEMA,
handler=_handle_read_file,
check_fn=_check_file_reqs,
)Discovery:
# model_tools.py uses AST inspection to find modules with register() calls
# Then imports them, triggering registration
discover_builtin_tools()Why it's clever: Adding a new tool requires zero configuration. Drop a .py file in tools/, call registry.register(), and AST-based discovery finds it automatically. No import lists, no config files, no registration boilerplate.
Tradeoff: Import-time side effects can make debugging harder. Module load order matters if tools depend on other tools.
2. Frozen Snapshot Pattern
Where: Memory system (tools/memory_tool.py)
Session Start → Load MEMORY.md → Create frozen snapshot → Inject into system prompt
│
During Session: memory writes → disk (durable) prompt unchanged
│
Next Session → Load updated MEMORY.md → New snapshot ───────────────┘
Why it's clever: This is the key to making Anthropic's prompt caching work. The system prompt is the cache key. If memory writes changed the system prompt mid-session, every memory_tool(add) call would invalidate the cache, increasing costs 4-10x. By freezing the snapshot, the prefix stays identical across all API calls in a session.
Tradeoff: The agent doesn't see its own memory writes during the current session. This is acceptable because:
- The agent already knows what it wrote (it just wrote it)
- Memory is for cross-session persistence, not intra-session state
- The todo tool handles intra-session task tracking
3. Adapter Pattern (Provider Abstraction)
Where: LLM providers (agent/anthropic_adapter.py, agent/bedrock_adapter.py), platform adapters (gateway/platforms/)
Common Interface
│
┌──────────────┼──────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌────┴──────┐
│ Anthropic │ │ OpenAI- │ │ Bedrock │
│ Messages │ │ Compatible│ │ Converse │
└───────────┘ └───────────┘ └───────────┘
For the gateway:
BasePlatformAdapter
│
┌─────────────────┼─────────────────┐
│ │ │
Telegram Discord Slack
Adapter Adapter Adapter
Why: A single run_conversation() method works across all LLM providers. A single GatewayRunner._handle_message() works across all platforms. New providers/platforms are added by implementing the adapter interface.
4. Thread Pool Concurrency (Parallel Tool Execution)
Where: run_agent.py:_execute_tool_calls_concurrent (line 7294)
# Decide parallelization:
# - Single tool → sequential
# - Multiple read-only tools → concurrent
# - File I/O → concurrent only if paths don't overlap
# - Destructive commands (rm, mv) → always sequential
with ThreadPoolExecutor(max_workers=N) as executor:
futures = [executor.submit(_invoke_tool, tc) for tc in batch]
results = [f.result() for f in futures] # preserve orderWhy it's clever: The agent can make multiple independent API calls, file reads, or web searches simultaneously. The path overlap detection (_paths_overlap(), line 328) prevents race conditions on file operations without being overly conservative.
Tradeoff: Thread pool is used instead of asyncio because many tool handlers are synchronous. The async bridge (_run_async()) handles async tools by running them in the thread pool's event loop.
5. Progressive Disclosure (Skills)
Where: Skills system (tools/skills_tool.py)
Tier 1: skills_list() → name + description (10 tokens per skill)
Tier 2: skill_view(name) → full SKILL.md (100-1000 tokens)
Tier 3: skill_view(name, file) → linked files (variable)
Why: A skill library with 100 skills would consume 10,000-100,000 tokens if fully loaded. Progressive disclosure lets the agent scan the index cheaply and only load what it needs.
Tradeoff: The agent needs two tool calls to fully load a skill (list → view). This adds one LLM turn but saves significant context budget.
6. Circuit Breaker (MCP Tool Handler)
Where: tools/mcp_tool.py
# Auto-reconnection with exponential backoff
# Up to 5 retries
# Fail-open after max retries (tool becomes unavailable)
# Dynamic tool discovery: listen for tools/list_changed notificationsWhy: MCP servers are external processes that can crash, hang, or become unreachable. The circuit breaker prevents the agent from blocking on a dead server.
7. Iteration Budget (Shared Resource Limiter)
Where: run_agent.py:170-255
class IterationBudget:
"""Shared across parent + all subagents. Thread-safe."""
total: int = 90
remaining: int # decremented by each LLM turn
_lock: threading.LockWhy it's clever: Without a shared budget, a delegation chain (parent → child → child's child) could burn through unlimited API calls. The budget is shared: if the parent uses 20 turns and delegates with a budget of 90, the child only has 70 turns remaining.
Grace call: One extra attempt when budget hits zero, so the model can produce a final summary instead of being cut off mid-thought.
8. WAL Mode with Convoy Prevention (SQLite)
Where: hermes_state.py:164-196
def _execute_write(self, sql, params):
"""BEGIN IMMEDIATE with jittered retry to avoid SQLite convoy."""
for attempt in range(max_retries):
try:
with self._conn:
self._conn.execute("BEGIN IMMEDIATE")
self._conn.execute(sql, params)
break
except sqlite3.OperationalError as e:
if "database is locked" in str(e):
sleep(random.uniform(0.02, 0.15)) # jittered 20-150msWhy: The gateway serves multiple platforms concurrently. WAL mode allows concurrent readers with a single writer. The jittered retry prevents the SQLite convoy problem where multiple writers synchronize on the same retry interval.
Checkpoint: Every 50 writes to manage WAL file growth.
Notable Design Tradeoffs
1. Giant Files vs. Module Decomposition
run_agent.py is 11,500 lines. cli.py is 10,000 lines. gateway/run.py is 9,800 lines.
Why they're monolithic: These files are the core orchestrators. Breaking them up would:
- Introduce import cycles (they reference each other's internals)
- Make the execution flow harder to trace (grep works well in one file)
- Add abstraction layers that don't earn their keep
Tradeoff: IDE navigation is harder. New contributors face a wall of code. But the alternative (dozens of small files with tangled imports) would be worse for a project of this complexity.
2. OpenAI SDK as Universal Client
Every provider is accessed through the OpenAI Python SDK's openai.OpenAI(base_url=...) pattern, with Anthropic and Bedrock as special cases.
Pro: One code path handles 200+ models. New providers "just work" if they're OpenAI-compatible. Con: Provider-specific features (Anthropic thinking, Gemini thought signatures) require adapter-level special cases. The abstraction leaks at the edges.
3. Subprocess-Based Environments
Docker, SSH, Singularity environments are driven via subprocess calls rather than client libraries.
Pro: Zero dependency overhead. Works identically on any system with the CLI installed. No API version mismatches. Con: Parsing subprocess output is fragile. Error handling relies on exit codes and stderr patterns. No structured data from the environment.
4. SQLite for Everything
Sessions, messages, FTS search, token tracking, cost accounting - all in SQLite.
Pro: Zero-dependency, file-based, portable. FTS5 enables cross-session search without Elasticsearch. WAL mode handles gateway concurrency. Con: Single-writer bottleneck under high concurrency. No built-in replication. Large session databases can grow to hundreds of MB.
5. Memory in Markdown Files
MEMORY.md and USER.md are plain markdown with § delimiters.
Pro: Human-readable. Editable with any text editor. No database dependency.
Con: Character limits (2,200 / 1,375) are small. No semantic search - the entire contents must fit in the context window. The § delimiter is unconventional.
6. Skills as Filesystem
Skills are directories with markdown files, not database records.
Pro: Git-friendly. Shareable as repos. Editable with any editor. No migration scripts. Con: No atomic multi-file updates. Race conditions possible if two sessions create skills simultaneously. Discovery requires filesystem scanning (mitigated by caching).
What's Unusual or Clever
1. RPC Stub Generation for Code Execution
code_execution_tool.py generates a hermes_tools.py file that scripts can import:
# Generated hermes_tools.py
def web_search(query):
"""Calls back to parent agent via UDS or file-based RPC."""
return _rpc_call("web_search", {"query": query})
def read_file(path):
return _rpc_call("read_file", {"file_path": path})This gives Python scripts zero-context-cost tool access. The script runs in a subprocess with RPC back to the parent. Tool results never pollute the conversation context.
2. Credential Pool Rotation
openrouter:
api_keys: ["key1", "key2", "key3"]On rate limit (429), the next key is tried before escalating to backoff. This multiplies effective rate limits by the number of keys.
3. Stale-Stream Health Checking
Even when streaming isn't needed for display, the agent uses streaming for health checking:
- 90-second stale-stream detection (no chunks → assume dead)
- 60-second per-chunk read timeout
This catches hung connections that non-streaming requests would wait on indefinitely.
4. Orphaned Tool Result Stripping
If the conversation history has tool results without matching tool calls (from a crashed prior turn), they're stripped before the API call. This prevents confusing the model with phantom tool outputs.
5. Pairing Code Authorization
For gateway access, unknown users receive a pairing code. They send it back to prove they have physical access to the admin's device. No OAuth flow, no database - just a short-lived code exchange.
6. Dual-Cache Skills Index
# Layer 1: In-process LRU cache (instant, lost on restart)
@lru_cache(maxsize=1)
def _cached_skills_index():
...
# Layer 2: Disk snapshot (.skills_prompt_snapshot.json)
# Survives restarts, validated by content hashThe skills index needs to be built from filesystem scanning (walking directories, parsing YAML frontmatter). This dual cache ensures the first API call of a session is fast.
Anti-Patterns Avoided
1. No ORM
SQLite is accessed directly with raw SQL. For a single-table schema with FTS5, an ORM would add complexity without value.
2. No Microservices
The gateway is a single process managing all platforms. This avoids:
- Inter-service communication overhead
- Deployment complexity
- Distributed state management
3. No Abstract Factory for Tools
Tools register themselves with concrete handlers. There's no abstract Tool base class or ToolFactory. This keeps the tool system simple - each tool is just a function with a JSON schema.
4. No Event Bus
Components communicate via direct method calls and callbacks, not an event bus. This makes the execution flow easy to trace (stack traces are meaningful).
Potential Pitfalls
1. Single-File Scale
Files exceeding 10K lines become hard to navigate and review. As the codebase grows, decomposition may become necessary despite the tradeoffs noted above.
2. Import-Time Registration
Tool modules execute code at import time. A bug in any tool module can prevent the entire agent from starting. This makes error isolation harder.
3. Thread Safety Surface Area
The tool registry, approval state, memory store, and session DB all have their own locking strategies. A deadlock between any two would freeze the agent. The current design avoids this by keeping lock scopes narrow, but the surface area is large.
4. Context Window Dependence
The memory system (2,200 + 1,375 = 3,575 chars) and skills index must fit in the context window alongside the conversation. Models with smaller windows may not leave enough room for useful conversation after system prompt injection.