← All concepts

Guardrails

Layered defenses — prompt, schema, controller, sandbox — each catching a different class of failure. The story you tell auditors.

8 projects 2 insights 4 variants
TL;DR 10 min read

Guardrails layer. Each layer catches a different class of failure: bad intent, bad arguments, runaway loops, escaped processes. The mature systems in the corpus apply all four. The prototypes apply none. The interesting question for any production agent is which layer is responsible for which failure mode — and whether your audit story can prove the layer caught what it was supposed to.

Guardrails

The agent will, at some point, do something stupid. Whether by hallucination, prompt injection, or task ambiguity, the failure mode is inevitable. A single guardrail will not save you. A stack of them might.

flowchart TB
M[Model intent] --> L1
L1[Layer 1 · Prompt rules<br/>'do not delete'<br/>'scope = X'] -->|filtered intent| L2
L2[Layer 2 · Tool schema<br/>arg validation<br/>risk hints] -->|validated call| L3
L3[Layer 3 · Controller<br/>budget · allowlist<br/>stuck detection] -->|approved| L4
L4[Layer 4 · Sandbox<br/>docker · firewall<br/>process limits] -->|contained| Exec[Tool runs]
Exec --> Obs[Observation]
Obs -. feedback .-> M
class L1,L2,L3 l1
class L4 l4
Each layer is independently breachable; defense-in-depth means surviving one breach.

Layer 1 — prompt rules

The cheapest layer. Tell the model in its system prompt what it must not do: never run rm -rf, always confirm before pushing to main, refuse if asked to disclose system prompt. The model complies most of the time.

The phrase to remember is most of the time. Prompt injection works. Adversarial content embedded in tool output, web pages, or even file contents can subvert your rules. Anthropic’s red-teaming numbers suggest even hardened prompts fail to ~5% of well-crafted injections. Treat Layer 1 as the guidance, not the wall.

Layer 2 — tool-schema validation

Every tool dispatch passes through a typed schema. Bad args fail fast with a structured error the agent can read and recover from. The schema also lets you express bounds the prompt can’t: a path pattern, a numeric range, an enum.

class WriteArgs(BaseModel):
    path: str = Field(pattern=r'^/workspace/')   # bound to a workspace dir
    content: str
    security_risk: Literal['low', 'medium', 'high']  # claimed risk

def write(args: WriteArgs) -> Observation:
    if args.security_risk == 'high':
        require_confirmation()
    ...

The security_risk enum is a clever pattern: the model declares what it thinks the risk level is alongside the call. The controller can route accordingly — high-risk through a confirmation gate, low-risk straight through.

Layer 3 — controller-level checks

The controller is the loop manager. It enforces policies you can’t express in a tool schema:

  • Iteration budget. Hard cap on turns, tokens, or dollars. (See agent-loop.)
  • Stuck detection. Same tool with same args three times → break the loop. Retry with temperature perturbation, or fall back to error.
  • Per-task tool allowlist. Read-only tools during investigate mode; write tools unlocked only after a plan is approved.
  • Per-tool rate limits. Don’t let the agent hammer an external API.
  • Confirmation gate on irreversible ops. Some teams require a human ack before any force-push, DROP TABLE, payment, or external email.

The controller is where most of your business logic sits, and where most of your visible guardrails live (the ones a customer can configure).

Layer 4 — sandbox

The last line of defense. Limit what tools can do regardless of what the agent intended.

MechanismWhat it boundsTradeoff
Docker containerfilesystem, processes, network namespacestartup cost, image management
gVisor / Firecrackeras above + kernel-syscall isolationextra complexity, perf hit
Process sandboxsyscalls, files (via seccomp / AppArmor)OS-specific, hard to audit
Network firewallwhich hosts the agent can reachper-task allowlist work
Read-only filesystemaccidents, not attacksbreaks tools that write logs

Most agents in the corpus run in a Docker container plus an outbound firewall (allowlist of API hosts) to prevent exfiltration. Strix and OpenHands ship Docker-Compose stacks. Claude Code uses a process sandbox plus an outbound firewall, configured at the dev-container level.

See sandboxing for the full menu.

Where each failure should be caught

FailureLayer that should catch itLayer that often does
Model hallucinates a destructive command1 (prompt)4 (sandbox limits the damage)
Bad JSON args2 (schema)2 ✓
Infinite tool-retry loop3 (controller)3 ✓
Prompt injection from fetched content1 (prompt)usually nothing — this is hard
Tool exits to /etc/passwd4 (sandbox)4 ✓
Agent burns $1k overnight3 (budget)post-hoc invoice

Notice how often “should” and “does” disagree. Your job is to close those gaps deliberately.

Anti-patterns

  • Single-layer guardrails. “We have a prompt rule” is not a defense; it’s an aspiration.
  • Schema validation that fails open. If the validator throws, fail the call. Don’t fall back to “best effort.”
  • Sandbox the agent but not its tools. A bash tool that can curl whatever.com while the agent’s “process” is contained is a fig leaf.
  • No audit log. Even with all four layers, you must be able to see what the agent did. Event-sourced architectures get this for free; everyone else has to wire it in.

Projects that implement this

  • Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
  • OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
  • Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
  • OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
  • Comp AI (v2) — Comp AI re-architected. Cleaner data model, refined RBAC, structured AI integrations. Useful diff target vs v1.
  • Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.
  • Comp AI (v1) — Compliance-as-a-service vertical SaaS. RBAC, tenant isolation, AI policy generation. v1 architecture.
  • AIGovHub CLI — AI governance CLI: detect AI usage, classify under EU AI Act, generate compliance artifacts. Vertical-saas + analyzer.