Tools Subsystem

Tools are the LLM's effector system. Strix ships ~15 tool modules covering shells, browsers, HTTP proxy, Python runtime, file editing, notes, reporting, coordination, thinking, finish, todo, skill loading, and web search. This doc covers the common plumbing and then each module in turn.

1. Core Plumbing

1.1 Registry (`strix/tools/registry.py`)

Tools register at import time via a decorator:

@register_tool(sandbox_execution=True)
async def terminal_execute(agent_state, command, timeout=60, ...):
    ...

The decorator (registry.py:190-251) does four things:

Capability gating (:175-187). Skips registration if:
- tool requires browser and STRIX_DISABLE_BROWSER is set,
- tool is web_search and PERPLEXITY_API_KEY is missing,
- running in "sandbox mode" (the FastAPI side) vs. host mode changes what's available.
Schema loading — for foo_actions.py it reads foo_actions_schema.xml from the same directory (_load_xml_schema, :47-88).
Parameter parsing — extracts <parameter name="..." required="..."> entries for runtime validation (_parse_param_schema, :90-115).
Indexing — appends to the global tools list and two lookup dicts: _tools_by_name, _tool_param_schemas (:239-240).

A ContextVar-based current-agent tracker lives at strix/tools/context.py:1-13 for tools that need to find "which agent called me" across asyncio.to_thread boundaries. Most tools don't use it — they receive agent_state as a parameter when the executor detects the parameter in their signature (registry.py:265-270).

1.2 Executor (`strix/tools/executor.py`)

The entry point for every tool call. execute_tool() (:29-115):

Route decision — should_execute_in_sandbox(tool_name) (:273-277). If the tool's sandbox_execution=True and we're not already inside the sandbox (STRIX_SANDBOX_MODE env flag), go remote.
Local path (:101-115)
- Look up function from registry.
- convert_arguments(fn, kwargs) — type-coerce strings from the LLM into the declared Python types (handles Union, Optional, JSON for lists/dicts, literal fallbacks). argument_parser.py:15-47.
- Inject agent_state if the function signature asks for it (needs_agent_state, registry.py:265-270).
- Await if coroutine, call otherwise.
Remote path (:39-98)
- httpx.AsyncClient POST to {sandbox_url}/execute with JSON {"agent_id", "tool_name", "kwargs"}.
- Authorization: Bearer {agent_state.sandbox_token}.
- 150s total timeout (120s server timeout + 30s buffer), 10s connect timeout.
- On {"error": ...} raise RuntimeError.
Result formatting — _format_tool_result (:227-256)
- If result is {"screenshot": "<base64>", …}, extract the image into a vision content block and strip from the text result.
- Truncate results >10KB to first 4KB + ellipsis + last 4KB.
- Wrap in <tool_result><tool_name>X</tool_name><result>Y</result></tool_result>.
History update — append to conversation_history as a user-role message; if images were extracted, the message is multi-part (:313-342).

1.3 Argument Parser (`strix/tools/argument_parser.py:15-47`)

Strings from the LLM coerced to declared types:

int/float — int(v)/float(v).
bool — v.lower() in {"true","1","yes"}.
list/dict — json.loads(v) first, then ast.literal_eval fallback.
Union[str, int] — tries str first, falls back to int parsing.
Optional[X] — empty string → None.

Not a full schema validator — it relies on the LLM following the XML schema + its own system-prompt instructions. The validation that does exist is at executor.py:130-162: checks required params are present and no unknown ones, returning human-readable error messages with schema hints back to the LLM.

1.4 The Schema Contract

Every tool module foo_actions.py has a sibling foo_actions_schema.xml. Shape:

<tools>
  <tool name="terminal_execute">
    <description>Execute a shell command in a persistent tmux session.</description>
    <parameters>
      <parameter name="command" type="string" required="true">
        <description>The shell command to run.</description>
      </parameter>
      <parameter name="timeout" type="integer" required="false">
        <description>Max seconds to wait before returning.</description>
      </parameter>
    </parameters>
    <returns type="Dict[str, Any]">…</returns>
    <examples>…</examples>
  </tool>
</tools>

The registry assembles these into the tools prompt via get_tools_prompt() (registry.py:280-300):

Groups by module (agents_graph, browser, terminal, proxy, ...).
Wraps each module's tools in a module tag: <agents_graph_tools>…</agents_graph_tools>.
Concatenates everything, injected into the system prompt via the jinja get_tools_prompt callback in llm.py:100-106.

This means the LLM sees the full XML spec of every available tool on every turn (modulo prompt caching — Anthropic ephemeral blocks let providers reuse the cached system prompt).

2. Tool Modules

Each has *_actions.py (implementation) and *_actions_schema.xml (LLM-facing spec). sandbox_execution flag in the decorator determines routing.

2.1 `agents_graph` — Multi-agent coordination (local)

strix/tools/agents_graph/agents_graph_actions.py. Local-only — needs direct access to the module-level agent graph.

Action	Purpose
`create_agent(task, name, inherit_context, skills)`	Spawn a subagent in a background thread with a focused task and up to 5 skills. Inherits parent's sandbox handle. (`:384-492`)
`agent_finish(result_summary, findings, …)`	Subagent signals completion; result propagates to parent via inter-agent message. (`:567-685`)
`send_message_to_agent(to, content, priority)`	Push a message into another agent's mailbox.
`wait_for_message(timeout)`	Idle until the mailbox has something (interactive-mode idle).
`view_agent_graph()`	Dump the full tree with statuses — used by root agents to decide when to close out.

2.2 `browser` — Playwright automation (sandbox)

strix/tools/browser/. Single tool browser_action(action, url, ...) with ~22 sub-actions: launch, goto, click, type, fill, scroll, execute_js, view_source, screenshot, save_pdf, wait_for, new_tab, switch_tab, close_tab, evaluate, intercept_requests, …

Persistent multi-tab browser instance kept across calls (browser_instance.py, tab_manager.py).
Every action captures a screenshot into the result — the executor then promotes it to a vision message.
Runs Chromium pre-installed in the image; NSS certs from Caido injected at entrypoint so HTTPS is MITM-intercepted by default (docker-entrypoint.sh:149-152).

2.3 `terminal` — tmux sessions (sandbox)

strix/tools/terminal/. Tool: terminal_execute(command, is_input, timeout, terminal_id).

Backed by tmux; state (CWD, env, running jobs) persists across calls.
is_input=true sends text into a running foreground process (interacting with sqlmap prompts, etc.).
Special key syntax — C-c, C-d, Enter, F1 handled without is_input flag.
timeout up to 60s; command keeps running in the background if it exceeds timeout, so the agent can poll again.
Terminal output is ANSI-parsed server-side by pyte (dep) to feed the TUI a clean replay.

2.4 `proxy` — Caido HTTP proxy (sandbox)

strix/tools/proxy/. Interacts with Caido's GraphQL API (port 48080).

Tool	Purpose
`list_requests`	HttpQL-filtered request log with pagination
`view_request(id, part="request"	"response")`
`send_request` / `repeat_request`	Craft or replay requests
`scope_rules`	Manage Caido scope for noise reduction

All system traffic (curl, httpx, browser) flows through Caido because the entrypoint sets http_proxy/https_proxy system-wide (docker-entrypoint.sh:115-144).

2.5 `python` — persistent IPython REPLs (sandbox)

strix/tools/python/. python_action(action, code, session_id):

new_session → fresh IPython kernel
execute → run code in that kernel (state persists: variables, imports)
close → kill the kernel

Pre-imports proxy helpers so agents can analyze/replay captured traffic from inside Python.

2.6 `file_edit` — OpenHands ACI (sandbox)

strix/tools/file_edit/. Three tools:

Tool	Purpose
`str_replace_editor(command, path, ...)`	`view`, `create`, `str_replace`, `insert`, `undo_edit`
`list_files(path, recursive)`	Directory listing
`search_files(path, regex, file_pattern)`	ripgrep-backed search

Reuses the editor primitives from the OpenHands project (openhands-aci sandbox-only dep).

2.7 `notes` — agent scratchpad (sandbox)

strix/tools/notes/. CRUD on categorized notes, persisted to a JSONL in the run directory. Categories: general, findings, methodology, questions, plan, wiki.

The wiki category is the shared repo memory that the source_aware_whitebox skill mandates — a single note per repository that every subagent reads-then-updates to share architecture/routing/sink maps.

2.8 `reporting` — Vulnerability reports (sandbox)

strix/tools/reporting/. create_vulnerability_report with title, severity, CVSS, endpoint, PoC code, remediation steps, code locations. Uses the cvss dependency to compute scores. Routed through llm/dedupe.py before being appended to the run's findings list.

2.9 `finish` — Scan completion (local)

strix/tools/finish/finish_actions.py. finish_scan(executive_summary, methodology, technical_analysis, recommendations):

Only callable by the root agent.
Validates all subagents are completed before accepting — forces the root to clean up its tree.
Writes final report and flips the tracer to completed state.

Subagents use agent_finish (the agents_graph module) instead.

2.10 `thinking` — Chain-of-thought scratchpad (local)

strix/tools/thinking/. think(thought) — a no-op tool whose only purpose is to record the agent's reasoning step without it counting as a substantive action. Encourages explicit planning.

2.11 `todo` — Structured task list (sandbox)

strix/tools/todo/. Create/update/complete todo items. The system prompt instructs the root agent to maintain a todo list as part of orchestration.

2.12 `load_skill` — Dynamic skill loading (local)

strix/tools/load_skill/load_skill_actions.py. The agent can pull additional markdown playbooks into its context mid-run:

Validates the requested skills exist.
Caps total loaded skills at 5 (skills/__init__.py:63-78).
Rebuilds the system prompt with the new skill set (llm.add_skills → _load_system_prompt).
Updates state.context["loaded_skills"] for observability.

2.13 `web_search` — Perplexity (local)

strix/tools/web_search/. web_search(query) hits Perplexity's sonar-reasoning-pro model. Registration is gated on PERPLEXITY_API_KEY being set (registry.py:175-187). Useful for "what's the latest CVE for Foo 1.2?"-style runtime queries.

3. Routing Summary: Host vs. Sandbox

Tool	Routing	Why
`agents_graph.*`	Host	Needs direct access to `_agent_graph` globals
`thinking.think`	Host	Introspective only
`finish.finish_scan`	Host	Synchronous subagent validation
`load_skill`	Host	Swaps in-process system prompt
`web_search`	Host	External HTTP call, no sandbox dep
`terminal.`, `python.`, `browser.`, `proxy.`, `file_edit.`, `notes.`, `reporting.`, `todo.`	Sandbox	Needs filesystem/process isolation and proxied network

Note: the same Python implementation is reused on both sides — the tool_server imports strix.tools and dispatches to the exact function the executor would have called locally. The routing decision happens at the caller, not the tool.

4. Transport Details

HTTP POST /execute request body:

{
  "agent_id": "agent_abc123",
  "tool_name": "terminal_execute",
  "kwargs": {"command": "nmap -sV target.tld", "timeout": 60}
}

Response:

{"result": { "stdout": "…", "exit_code": 0 }, "error": null}

Per-agent cancellation (runtime/tool_server.py:94-97): if the same agent_id submits a new call while a previous one is in-flight, the server cancels the prior task. This stops long-running tools from bleeding into the next iteration when the user interrupts.

5. Design Observations

Good ideas:

Schemas as data, not docstrings. XML files are the single source of truth for what the LLM sees; the Python functions can be refactored without touching the LLM contract.
Same code, two routes. Tools are routing-agnostic — they don't know if they're running locally or over HTTP, which keeps the implementation simple.
Per-agent cancellation in the server. Solves the "kill in-flight when user hits Esc" problem without needing a side-channel.
Screenshot-as-result. Cleanest multimodal integration — the agent asks for a click, gets back text + an image, and can reason over what appeared on screen without any extra tool.
Tool result truncation + XML wrap. Prevents huge tool outputs (a wordlist fuzz, a semgrep run) from blowing the context window while keeping machine-parseable structure.

Potential pitfalls:

Only the first tool call per message is honored. If the LLM reliably emits two (happens with some providers under pressure), the second is silently dropped. The parser at least stops early on </function> which minimizes waste, but no warning is emitted to the LLM.
Argument type coercion is generous. Passing a string where an int is expected sometimes succeeds, sometimes fails with a bare ValueError. A stricter pre-check with a helpful error message would probably improve the LLM's self-correction.
No circuit breaker on repeated tool errors. If a tool keeps failing (wrong path, bad syntax), the agent can burn iterations without the framework intervening.
Tool schemas live as XML alongside code but are not schema-checked against the function signature at import time. Drift between XML and Python is possible. A lightweight schema == signature test would catch this.