TL;DR 10 min read

Guardrails layer. Each layer catches a different class of failure: bad intent, bad arguments, runaway loops, escaped processes. The mature systems in the corpus apply all four. The prototypes apply none. The interesting question for any production agent is which layer is responsible for which failure mode — and whether your audit story can prove the layer caught what it was supposed to.

Guardrails

The agent will, at some point, do something stupid. Whether by hallucination, prompt injection, or task ambiguity, the failure mode is inevitable. A single guardrail will not save you. A stack of them might.

flowchart TB
M[Model intent] --> L1
L1[Layer 1 · Prompt rules<br/>'do not delete'<br/>'scope = X'] -->|filtered intent| L2
L2[Layer 2 · Tool schema<br/>arg validation<br/>risk hints] -->|validated call| L3
L3[Layer 3 · Controller<br/>budget · allowlist<br/>stuck detection] -->|approved| L4
L4[Layer 4 · Sandbox<br/>docker · firewall<br/>process limits] -->|contained| Exec[Tool runs]
Exec --> Obs[Observation]
Obs -. feedback .-> M
class L1,L2,L3 l1
class L4 l4

Each layer is independently breachable; defense-in-depth means surviving one breach.

Layer 1 — prompt rules

The cheapest layer. Tell the model in its system prompt what it must not do: never run rm -rf, always confirm before pushing to main, refuse if asked to disclose system prompt. The model complies most of the time.

The phrase to remember is most of the time. Prompt injection works. Adversarial content embedded in tool output, web pages, or even file contents can subvert your rules. Anthropic’s red-teaming numbers suggest even hardened prompts fail to ~5% of well-crafted injections. Treat Layer 1 as the guidance, not the wall.

Layer 2 — tool-schema validation

Every tool dispatch passes through a typed schema. Bad args fail fast with a structured error the agent can read and recover from. The schema also lets you express bounds the prompt can’t: a path pattern, a numeric range, an enum.

class WriteArgs(BaseModel):
    path: str = Field(pattern=r'^/workspace/')   # bound to a workspace dir
    content: str
    security_risk: Literal['low', 'medium', 'high']  # claimed risk

def write(args: WriteArgs) -> Observation:
    if args.security_risk == 'high':
        require_confirmation()
    ...

The security_risk enum is a clever pattern: the model declares what it thinks the risk level is alongside the call. The controller can route accordingly — high-risk through a confirmation gate, low-risk straight through.

Layer 3 — controller-level checks

The controller is the loop manager. It enforces policies you can’t express in a tool schema:

Iteration budget. Hard cap on turns, tokens, or dollars. (See agent-loop.)
Stuck detection. Same tool with same args three times → break the loop. Retry with temperature perturbation, or fall back to error.
Per-task tool allowlist. Read-only tools during investigate mode; write tools unlocked only after a plan is approved.
Per-tool rate limits. Don’t let the agent hammer an external API.
Confirmation gate on irreversible ops. Some teams require a human ack before any force-push, DROP TABLE, payment, or external email.

The controller is where most of your business logic sits, and where most of your visible guardrails live (the ones a customer can configure).

Layer 4 — sandbox

The last line of defense. Limit what tools can do regardless of what the agent intended.

Mechanism	What it bounds	Tradeoff
Docker container	filesystem, processes, network namespace	startup cost, image management
gVisor / Firecracker	as above + kernel-syscall isolation	extra complexity, perf hit
Process sandbox	syscalls, files (via seccomp / AppArmor)	OS-specific, hard to audit
Network firewall	which hosts the agent can reach	per-task allowlist work
Read-only filesystem	accidents, not attacks	breaks tools that write logs

Most agents in the corpus run in a Docker container plus an outbound firewall (allowlist of API hosts) to prevent exfiltration. Strix and OpenHands ship Docker-Compose stacks. Claude Code uses a process sandbox plus an outbound firewall, configured at the dev-container level.

See sandboxing for the full menu.

Where each failure should be caught

Failure	Layer that should catch it	Layer that often does
Model hallucinates a destructive command	1 (prompt)	4 (sandbox limits the damage)
Bad JSON args	2 (schema)	2 ✓
Infinite tool-retry loop	3 (controller)	3 ✓
Prompt injection from fetched content	1 (prompt)	usually nothing — this is hard
Tool exits to `/etc/passwd`	4 (sandbox)	4 ✓
Agent burns $1k overnight	3 (budget)	post-hoc invoice

Notice how often “should” and “does” disagree. Your job is to close those gaps deliberately.

Anti-patterns

Single-layer guardrails. “We have a prompt rule” is not a defense; it’s an aspiration.
Schema validation that fails open. If the validator throws, fail the call. Don’t fall back to “best effort.”
Sandbox the agent but not its tools. A bash tool that can curl whatever.com while the agent’s “process” is contained is a fig leaf.
No audit log. Even with all four layers, you must be able to see what the agent did. Event-sourced architectures get this for free; everyone else has to wire it in.

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
Comp AI (v2) — Comp AI re-architected. Cleaner data model, refined RBAC, structured AI integrations. Useful diff target vs v1.
Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.
Comp AI (v1) — Compliance-as-a-service vertical SaaS. RBAC, tenant isolation, AI policy generation. v1 architecture.
AIGovHub CLI — AI governance CLI: detect AI usage, classify under EU AI Act, generate compliance artifacts. Vertical-saas + analyzer.

OpenHands (v0) ●●●

Temperature perturbation as recovery from empty responses

A specific failure mode (empty response with temp=0) has a specific cheap fix. Worth knowing because it's not in any tutorial.

agent-loop guardrails

Strix ●●●

Authorized scope injected into system prompt at render time

A pentest agent that can be talked out of scope is dangerous. Putting scope in the locked system prompt — not the message log — defeats prompt injection.

guardrails multi-agent-coordination

Guardrails

Guardrails

Layer 1 — prompt rules

Layer 2 — tool-schema validation

Layer 3 — controller-level checks

Layer 4 — sandbox

Where each failure should be caught

Anti-patterns

Projects that implement this

Related insights