Guardrails layer. Each layer catches a different class of failure: bad intent, bad arguments, runaway loops, escaped processes. The mature systems in the corpus apply all four. The prototypes apply none. The interesting question for any production agent is which layer is responsible for which failure mode — and whether your audit story can prove the layer caught what it was supposed to.
Guardrails
The agent will, at some point, do something stupid. Whether by hallucination, prompt injection, or task ambiguity, the failure mode is inevitable. A single guardrail will not save you. A stack of them might.
flowchart TB M[Model intent] --> L1 L1[Layer 1 · Prompt rules<br/>'do not delete'<br/>'scope = X'] -->|filtered intent| L2 L2[Layer 2 · Tool schema<br/>arg validation<br/>risk hints] -->|validated call| L3 L3[Layer 3 · Controller<br/>budget · allowlist<br/>stuck detection] -->|approved| L4 L4[Layer 4 · Sandbox<br/>docker · firewall<br/>process limits] -->|contained| Exec[Tool runs] Exec --> Obs[Observation] Obs -. feedback .-> M class L1,L2,L3 l1 class L4 l4
Layer 1 — prompt rules
The cheapest layer. Tell the model in its system prompt what it must not do: never run rm -rf, always confirm before pushing to main, refuse if asked to disclose system prompt. The model complies most of the time.
The phrase to remember is most of the time. Prompt injection works. Adversarial content embedded in tool output, web pages, or even file contents can subvert your rules. Anthropic’s red-teaming numbers suggest even hardened prompts fail to ~5% of well-crafted injections. Treat Layer 1 as the guidance, not the wall.
Layer 2 — tool-schema validation
Every tool dispatch passes through a typed schema. Bad args fail fast with a structured error the agent can read and recover from. The schema also lets you express bounds the prompt can’t: a path pattern, a numeric range, an enum.
class WriteArgs(BaseModel):
path: str = Field(pattern=r'^/workspace/') # bound to a workspace dir
content: str
security_risk: Literal['low', 'medium', 'high'] # claimed risk
def write(args: WriteArgs) -> Observation:
if args.security_risk == 'high':
require_confirmation()
...
The security_risk enum is a clever pattern: the model declares what it thinks the risk level is alongside the call. The controller can route accordingly — high-risk through a confirmation gate, low-risk straight through.
Layer 3 — controller-level checks
The controller is the loop manager. It enforces policies you can’t express in a tool schema:
- Iteration budget. Hard cap on turns, tokens, or dollars. (See agent-loop.)
- Stuck detection. Same tool with same args three times → break the loop. Retry with temperature perturbation, or fall back to error.
- Per-task tool allowlist. Read-only tools during investigate mode; write tools unlocked only after a plan is approved.
- Per-tool rate limits. Don’t let the agent hammer an external API.
- Confirmation gate on irreversible ops. Some teams require a human ack before any
force-push,DROP TABLE, payment, or external email.
The controller is where most of your business logic sits, and where most of your visible guardrails live (the ones a customer can configure).
Layer 4 — sandbox
The last line of defense. Limit what tools can do regardless of what the agent intended.
| Mechanism | What it bounds | Tradeoff |
|---|---|---|
| Docker container | filesystem, processes, network namespace | startup cost, image management |
| gVisor / Firecracker | as above + kernel-syscall isolation | extra complexity, perf hit |
| Process sandbox | syscalls, files (via seccomp / AppArmor) | OS-specific, hard to audit |
| Network firewall | which hosts the agent can reach | per-task allowlist work |
| Read-only filesystem | accidents, not attacks | breaks tools that write logs |
Most agents in the corpus run in a Docker container plus an outbound firewall (allowlist of API hosts) to prevent exfiltration. Strix and OpenHands ship Docker-Compose stacks. Claude Code uses a process sandbox plus an outbound firewall, configured at the dev-container level.
See sandboxing for the full menu.
Where each failure should be caught
| Failure | Layer that should catch it | Layer that often does |
|---|---|---|
| Model hallucinates a destructive command | 1 (prompt) | 4 (sandbox limits the damage) |
| Bad JSON args | 2 (schema) | 2 ✓ |
| Infinite tool-retry loop | 3 (controller) | 3 ✓ |
| Prompt injection from fetched content | 1 (prompt) | usually nothing — this is hard |
Tool exits to /etc/passwd | 4 (sandbox) | 4 ✓ |
| Agent burns $1k overnight | 3 (budget) | post-hoc invoice |
Notice how often “should” and “does” disagree. Your job is to close those gaps deliberately.
Anti-patterns
- Single-layer guardrails. “We have a prompt rule” is not a defense; it’s an aspiration.
- Schema validation that fails open. If the validator throws, fail the call. Don’t fall back to “best effort.”
- Sandbox the agent but not its tools. A bash tool that can
curl whatever.comwhile the agent’s “process” is contained is a fig leaf. - No audit log. Even with all four layers, you must be able to see what the agent did. Event-sourced architectures get this for free; everyone else has to wire it in.
Projects that implement this
- Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
- OpenHands (v0) — All-hands AI v0 — autonomous software engineer agent. Event-sourced state, microagents, controller-level guardrails.
- Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
- OpenHands (v1) — OpenHands re-architected: cleaner controller, refined memory condenser, improved tool dispatch. v1 of the All-Hands agent.
- Comp AI (v2) — Comp AI re-architected. Cleaner data model, refined RBAC, structured AI integrations. Useful diff target vs v1.
- Claude Financial Services — Reference architecture for finance-vertical Claude integrations. Patterns for compliant LLM use in regulated domains.
- Comp AI (v1) — Compliance-as-a-service vertical SaaS. RBAC, tenant isolation, AI policy generation. v1 architecture.
- AIGovHub CLI — AI governance CLI: detect AI usage, classify under EU AI Act, generate compliance artifacts. Vertical-saas + analyzer.
Related insights
A specific failure mode (empty response with temp=0) has a specific cheap fix. Worth knowing because it's not in any tutorial.
A pentest agent that can be talked out of scope is dangerous. Putting scope in the locked system prompt — not the message log — defeats prompt injection.