Agent Loop & LLM Layer
The "brain" of Strix: how one iteration of the ReAct loop actually runs, how state is carried between iterations, how the LLM is wrapped, and the tricks played on top (memory compression, LLM-based dedupe, streaming tool-call parsing, prompt caching).
1. The Agent Loop
Implemented in BaseAgent.agent_loop() at
strix/agents/base_agent.py:152-260. It's an async def driven by
streaming LLM output:
while True:
if stop_requested: break # base_agent.py:163
process_incoming_messages() # :168 — other agents can poke us
if waiting_for_input: await message() # :170-172 — idle in interactive
if should_stop(): break # :174-178 — completed / max_iter
state.iteration += 1
warn_if_nearing_budget() # :186-211 — "3 iters left!"
ok = await _process_iteration(tracer) # :214-217 wrapped in Task
if ok: break
One _process_iteration does roughly:
- Call
llm.generate(state.messages)as an async stream. - Feed each chunk into the tracer so the TUI can render live.
- Stop streaming early once
</function>is seen (≤5 chunks of slack,llm/llm.py:184-197) — saves tokens because only the first tool call is honored anyway. - Build the completed response (
stream_chunk_builder,llm/llm.py:201), extract token stats andthinking_blocks. - Parse XML tool calls via
parse_tool_invocations(llm/utils.py:80-107). process_tool_invocations(actions, history, state)— dispatch each action through the executor, append the<tool_result>tostate.messages.- Return
should_finish(set byfinish_scan/agent_finish).
Stopping conditions
state.stop_requested— user pressed Esc / Ctrl+C (base_agent.py:615-623)state.completedset byfinish_scantoolstate.iteration >= state.max_iterations(default 300,agents/state.py:22)- Interactive-mode idle + message timeout
(default 600s,
state.py:28, 119-135)
Error handling
LLMRequestFailedError→_handle_llm_error()(base_agent.py:568-601). In non-interactive mode: mark incomplete and raise. In interactive mode: enter waiting state withllm_failed=Trueand let the human recover.asyncio.CancelledErroris caught (base_agent.py:232) — a cancelled in-flight tool call still updates state gracefully.Runtime/Value/TypeErrorfunnel into_handle_iteration_error(base_agent.py:603-613).
Empty-response corrective
If the LLM returns whitespace or no tool call
(base_agent.py:379-393), the loop injects a synthetic user turn
reminding the agent it must issue a tool call. This is a cheap guardrail
against the model saying "Sure, I'll do that now" and stopping.
2. Agent State (strix/agents/state.py)
AgentState is a pydantic.BaseModel carrying the entire live state —
not just the message log.
| Field | Purpose |
|---|---|
agent_id, agent_name, parent_id |
Identity + position in the agent graph (lines 13-15) |
sandbox_id, sandbox_token, sandbox_info |
Handle the executor uses to reach the FastAPI tool server (lines 16-18) |
task, iteration, max_iterations |
Plan + budget (lines 20-22) |
completed, stop_requested, waiting_for_input, llm_failed |
Control flags (lines 23-26) |
waiting_start_time, waiting_timeout |
Idle-timeout tracking (27-28) |
final_result: dict |
Payload agent_finish writes back up (29) |
messages: list[dict] |
Full conversation history — role, content, and optional thinking_blocks (32) |
actions_taken, observations, errors |
Structured audit trail per iteration (38-76) |
context: dict |
Free-form scratchpad (loaded skills, shared keys) (33) |
Serialized with model_dump() and stashed in the graph registry so parents
can introspect children. Subagent state is handed in at construction via
the agent_state kwarg (base_agent.py:67-74).
3. LLM Wrapper (strix/llm/llm.py)
Non-trivial glue over litellm.acompletion. Responsibilities:
Provider resolution
resolve_strix_model() in llm/utils.py:47-61 checks
STRIX_MODEL_MAP (keys like strix/claude, strix/gpt-5.4) and rewrites
the litellm model string + api_base. Unknown names pass through to litellm
directly, so the universe of usable providers is "whatever litellm
supports" (OpenAI, Anthropic, Vertex, Bedrock, Azure, Ollama, OpenRouter,
custom OpenAI-compatible endpoints).
Streaming + early stop
LLM.generate() (llm/llm.py:173-209) yields LLMResponse objects
asynchronously. After each chunk, it concatenates the accumulated text and
looks for </function>. Once seen, it stops streaming a few chunks later
(buffer for closing tags). The rationale: the system prompt forbids
multiple tool calls per message, so anything after the first </function>
is throwaway.
Final response construction
stream_chunk_builderfrom litellm reassembles a full response object so cost/usage stats are accurate (llm.py:201).- Tool calls are parsed via
parse_tool_invocations(utils.py:80-107) with regex:<function=([^>]+)>for name<parameter=([^>]+)>(.*?)</parameter>for params
- Legacy-format shims (
utils.py:12-31) convert old<invoke name="X">...<parameter name="Y">blocks emitted by older Claude versions into the canonical<function=X>/<parameter=X>format. - Only the first
<function>...</function>is kept (utils.py:64-77) — drops malformed / multi-call messages.
System prompt assembly
_load_system_prompt (llm/llm.py:84-142) uses Jinja2:
result = env.get_template("system_prompt.jinja").render(
get_tools_prompt=get_tools_prompt, # function callback
loaded_skill_names=list(skill_content.keys()),
interactive=self.config.interactive,
system_prompt_context=self._system_prompt_context,
**skill_content, # skill md as template vars
)The skill set is computed by _get_skills_to_load()
(llm.py:111-125):
ordered_skills = [*self._active_skills]
ordered_skills.append(f"scan_modes/{self.config.scan_mode}")
if self.config.is_whitebox:
ordered_skills.append("coordination/source_aware_whitebox")
ordered_skills.append("custom/source_aware_sast")i.e. user-requested skills → scan mode skill → whitebox coordination skills,
deduplicated, with a max of 5 per agent (enforced at load_skill time,
skills/__init__.py:63-78).
Message construction
Before sending (llm.py:211-239):
- System message with rendered prompt.
- Identity block (agent metadata — id, name, parent, sandbox_id) as a hidden marker the model can introspect if needed.
- Run
MemoryCompressoron the message list (see §4). - If provider supports it (Anthropic), attach
cache_control: {"type": "ephemeral"}to the system message. That makes the giant jinja-rendered system prompt cacheable between turns — a big cost win since it can run several MB.
Token & cost accounting
RequestStats dataclass (llm.py:44-58) accumulates input_tokens,
output_tokens, cached_tokens, and dollar cost. Extracted from
response.usage (regular + prompt_tokens_details.cached_tokens) and
litellm.completion_cost() at llm.py:278-315.
Retry strategy
At llm.py:156-172: exponential backoff min(90, 2 * (2**attempt)),
default max 5 retries (STRIX_LLM_MAX_RETRIES env), only on statuses
that litellm._should_retry() considers retryable.
Reasoning effort
Three-tier resolution (llm.py:74-82):
Config.get("strix_reasoning_effort")— explicit env var.LLMConfig.reasoning_effort— programmatic override.- Default by scan mode:
quick → medium, elsehigh.
Only applied if the provider advertises reasoning via
supports_reasoning() (llm capability probe, llm.py:340-344).
4. Memory Compression (strix/llm/memory_compressor.py)
The compressor runs every turn before the request is sent. It short- circuits when the total is under 90k tokens.
- Budget constants:
MAX_TOTAL_TOKENS = 100_000, trigger at0.9 * MAX = 90_000; always keep at least 15 recent messages; keep 3 most recent images (memory_compressor.py:12-13, 198, 208). - Older messages are chunked (10 at a time,
:212) and summarized by an LLM call usingSUMMARY_PROMPT_TEMPLATE(:15-43). The prompt explicitly enumerates what must survive: vulnerabilities, credentials, architecture, tool outputs, failed attempts, URLs, payloads, versions, error messages. - Image handling (
:134-149): older images are replaced with a text placeholder to avoid blowing up vision tokens. - Timeout: 120s default (
memory_compressor.py:161) so a slow compressor can't stall the loop indefinitely.
This is one of the few places Strix spends tokens on "meta" LLM work — accepted trade-off vs. rule-based truncation that would inevitably drop something important on a multi-hour scan.
5. LLM-based Deduplication (strix/llm/dedupe.py)
When a subagent emits a create_vulnerability_report, dedupe decides
whether it matches an existing finding:
- Comparison criteria (
dedupe.py:20-31):- Same root cause (not just same class — "missing input validation" is the root, not "SQL injection")
- Same affected component/endpoint/file
- Same exploitation method
- Would be fixed by the same code change
- Negative cases — keep separate: different endpoints, different parameters, different root causes (stored vs reflected XSS), different auth contexts.
- The LLM call returns XML:
<is_duplicate>,<duplicate_id>,<confidence>,<reason>(parsed at:111-139). - Gated behind
DEDUPE_SYSTEM_PROMPT(:14) that tells the classifier model to be conservative — prefer splits over merges.
Why LLM-based rather than string hashing? Reports vary by wording, exploit payload, line number. A literal hash can't tell that two reports both describe the same missing auth check in different words.
6. Tool-Call Format
Canonical format emitted by the LLM and parsed by utils.py:80-107:
<function=terminal_execute>
<parameter=command>sqlmap -u "https://target/item?id=1" -p id --batch</parameter>
<parameter=timeout>60</parameter>
</function>Design notes:
- Human-readable; every provider can emit it as plain text.
- Streams well — the TUI's
streaming_parser.pyrenders parameters as they arrive. - HTML-entity decoded after parsing so
<-escaped payloads pass through intact (utils.py:102). - Strict one-call-per-message: the parser drops everything after the
first
</function>(utils.py:64-77), and the system prompt repeats the rule four times (system_prompt.jinja:376-402). This keeps the loop linearizable.
The inverse direction — tool → LLM — uses:
<tool_result>
<tool_name>terminal_execute</tool_name>
<result>… stdout …</result>
</tool_result>If a tool result dict contains a screenshot key, the base64 image is
hoisted into a vision message attached to the tool_result
(tools/executor.py:227-256). Long outputs are truncated to first 4KB +
last 4KB (:246-249).
7. Multi-Agent Coordination
The agent graph lives in module-level dicts on BaseAgent:
_agent_graph— parent→children adjacency_agent_instances— liveBaseAgentobjects_agent_states— serializedAgentStatedumps_agent_messages[agent_id] = list[Message]— inter-agent mailbox
(base_agent.py:119-150, 456)
Spawning (via create_agent tool)
agents_graph_actions.create_agent (tools/agents_graph/ agents_graph_actions.py:384-492) spawns a new StrixAgent in a
background thread, passing parent_id + optionally inherited conversation
history + a focused skill set (1–5 skills per the rules in root_agent.md).
Subagents inherit the parent's sandbox_info — same container, same tool
server, same bearer token.
Messaging
send_message_to_agent and wait_for_message form the IPC. Messages are
wrapped in an <inter_agent_message> XML block
(base_agent.py:491-514) so the agent can syntactically tell them
apart from tool results. Fields: from, content, message_type,
priority, timestamp.
Completion
- Subagents call
agent_finish— recordsresult_summary, findings, bubbles messages up to parent (agents_graph_actions.py:567-685). - Root agent calls
finish_scan— synchronously verifies all subagents are completed before flushing the final report (tools/finish/finish_actions.py).
Why share a container?
The system prompt makes this explicit
(system_prompt.jinja:233-238):
All agents run in the same shared Docker container for efficiency. Each agent has its own browser/terminal sessions. All agents share /workspace and proxy history.
Trade-off: cheap, discoverable coordination (see your sibling's traffic in Caido); no per-agent network isolation.
8. Guardrails (code-level)
| Guardrail | Where |
|---|---|
| Iteration cap (default 300) with warnings at 85% and last 3 iters | base_agent.py:186-211, state.py:22 |
| Waiting timeout in interactive mode | base_agent.py:261-285, state.py:119-135 |
| One tool call per message (parser truncates, prompt repeats the rule) | utils.py:64-77, system_prompt.jinja:376-402 |
| Empty-response corrective | base_agent.py:379-393 |
| LLM retry with exponential backoff | llm.py:156-172 |
| Max 5 skills per agent | skills/__init__.py:63-78 |
| 120s per-tool timeout in sandbox | runtime/tool_server.py:100-110, STRIX_SANDBOX_EXECUTION_TIMEOUT |
| Memory compression at 90k tokens | memory_compressor.py:12-13, 208 |
| Response-size truncation for tool results (>10KB) | tools/executor.py:246-249 |
| PII scrubbing on telemetry payloads | telemetry/utils.py (scrubadub + regex) |
STRIX_LLM required / API key optional (supports IAM-based providers) |
interface/main.py:52-255 |
| Screenshot key redaction (key+value) before telemetry export | telemetry/utils.py |
9. Things To Learn From / Pitfalls
Good ideas:
- Streaming the tool call with an early-stop on
</function>is a neat cost optimization that few other frameworks bother with. - Prompt caching the entire rendered jinja is a big win given how long Strix's system prompts get (especially with 5 skills loaded).
- LLM-driven compression with explicit preservation instructions is more reliable than sliding-window truncation for security contexts.
- Identity block stamped into every message (
llm.py:215-226) lets the agent answer "who am I and who's my parent" without needing extra tools. - XML tool calls as provider-agnostic — no vendor lock-in.
Potential pitfalls:
- Shared container across agents = shared browser cookies, shared proxy history. In a multi-target scan, one agent can inadvertently pollute another's state. Strix's answer is per-agent browser sessions, but tmux and Caido are global.
- LLM dedupe is stochastic — a run where dedupe decides "not duplicate" on a close call bloats the report, and vice versa.
- Default cap of 300 iterations is generous; agents can still miss something if the LLM stalls. A shallow-termination detector (consecutive no-progress iterations) would help.
- Global module-level dicts for the agent graph make it hard to run two scans in one process. Fine for CLI, awkward for a library user.