TL;DR 6 min read

A small architectural bet pays back over years: store the agent’s expertise — methodologies, playbooks, attack patterns, code-review checklists — as markdown files in a skills/ directory, not as code. Each file has a name, a trigger, and a body. The agent loads matched skills into the system prompt at task start. Domain experts contribute via PR. The loop almost never changes; the library evolves weekly. This is the moat.

Skills as markdown

The pattern looks pedestrian at first. A skills/ folder. Inside, a few dozen .md files. Each has frontmatter (a name, a trigger), a body of guidance, and maybe some examples. The agent’s loop matches relevant skills to the current task and concatenates their bodies into the system prompt.

That’s it. The reason it’s an architectural decision worth naming is that it changes who can extend the agent.

Anatomy of a skill file

---
name: SQL injection testing
trigger: "user asks about SQL injection or database security"
priority: 1
---

# SQL Injection Testing Methodology

## When to use this skill
You're investigating a web application's database interaction surface.

## Approach
1. Identify entry points (form fields, URL params, headers).
2. Test for error-based injection first — most signal.
3. Move to UNION-based if errors are silent.
4. ...

The agent’s loop:

on task arrival:
  1. compute trigger matches against this skill library
  2. concatenate matched skill bodies into the system prompt
  3. start the loop

Why this is a strong pattern

flowchart LR
Author[Domain expert] -->|writes .md| Repo[skills/ dir]
Repo -->|PR review| CI
CI -->|merge| Repo
Repo -->|matched triggers| Agent
Agent -->|enriched context| LLM

Authoring lives in markdown PRs; the agent merely runs the loaded skill at the right moment.

Reviewability. A pentester reviews a .md PR. They never have to learn Python.
Versionability. git log skills/sql-injection.md shows the methodology’s evolution. No archaeology through prompt builders.
Locality. Skill, trigger, and rationale all live in one file. Code-based skills scatter across modules.
Forkability. A new project lifts a curated subset as a starting library.

Strix as the maximalist case

Strix takes this to ~30 markdown files in strix/skills/ covering web, network, mobile, and crypto attack methodologies. New methodologies arrive as PRs from researchers who never touch the agent’s Python code.

Claude Code’s skill system

Claude Code ships with built-in skills (e.g. init, review, security-review, claude-api) — markdown bundles a user invokes with /skill-name. User-defined skills live alongside in ~/.claude/skills/. Same shape, two scopes (vendor + user).

Why this took until ~2024 to appear

Three preconditions had to land:

Long enough context windows that loading a dozen skills fits.
Prompt caching that makes repeated loading cheap.
Agent loops stable enough that there’s somewhere for skills to plug in.

Pre-2023, the windows were too small. Pre-late-2023, caching was too immature. The pattern depended on infrastructure shifts.

Skill-loading strategies

Strategy	When	Cost
Always load all	small library (<10 skills, <20K tok)	constant cache hit
Trigger-matched	medium library, fast triggers	cheap regex / keyword match per task
LLM-routed	large library, fuzzy triggers	extra LLM call to pick skills
User-invoked	skills with overhead per call	none until invoked (`/skill-name`)

Most production agents combine: cheap match for common tasks, LLM router for ambiguous ones, user invocation for skills the user knows are relevant.

Triggers — the underrated piece

A skill is only as useful as its trigger. Bad triggers either miss the right calls or fire too often, bloating context.

String match (cheap, brittle): “if user message contains ‘sql’ or ‘database.’”
Tag-based: skills declare tags; the agent provides a tag during planning.
LLM-routed: ask a small auxiliary LLM “given this task, pick from this list.”
Manual: /security-review, no inference needed.

Pick a skill strategy

? How many skills, and how confident is your trigger?

Library is small + skills are general Always load all. Predictable cache.
Library is moderate with clear keywords String/tag-matched triggers.
Library is large with fuzzy applicability LLM-routed triggers.
Skills are situational and user-knows Manual invocation only.

Recommended default: Start with always-load-all. Only complicate when the library is too big to load comfortably.

When markdown skills don’t help

Tiny agents. Five hardcoded prompts is not worth the indirection.
Highly dynamic skills. If skill content depends on runtime data (DB schema, user config), it needs templating; markdown is the wrong format.
High-stakes determinism. If you need each invocation to be byte-identical (legal, compliance copy you must reproduce verbatim), code is more auditable than markdown.

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.

Strix ●●●

Markdown-as-prompt-library architecture

Decouples the agent's loop from its expertise. Domain experts contribute via PR; the loop almost never changes; the library evolves weekly.

skills-as-md agent-loop

Skills as markdown

Projects that implement this

Related insights