Agent Loop & LLM Layer

The "brain" of Strix: how one iteration of the ReAct loop actually runs, how state is carried between iterations, how the LLM is wrapped, and the tricks played on top (memory compression, LLM-based dedupe, streaming tool-call parsing, prompt caching).

1. The Agent Loop

Implemented in BaseAgent.agent_loop() at strix/agents/base_agent.py:152-260. It's an async def driven by streaming LLM output:

while True:
    if stop_requested: break                  # base_agent.py:163
    process_incoming_messages()               # :168 — other agents can poke us
    if waiting_for_input: await message()     # :170-172 — idle in interactive
    if should_stop(): break                   # :174-178 — completed / max_iter
    state.iteration += 1
    warn_if_nearing_budget()                  # :186-211 — "3 iters left!"
    ok = await _process_iteration(tracer)     # :214-217 wrapped in Task
    if ok: break

One _process_iteration does roughly:

Call llm.generate(state.messages) as an async stream.
Feed each chunk into the tracer so the TUI can render live.
Stop streaming early once </function> is seen (≤5 chunks of slack, llm/llm.py:184-197) — saves tokens because only the first tool call is honored anyway.
Build the completed response (stream_chunk_builder, llm/llm.py:201), extract token stats and thinking_blocks.
Parse XML tool calls via parse_tool_invocations (llm/utils.py:80-107).
process_tool_invocations(actions, history, state) — dispatch each action through the executor, append the <tool_result> to state.messages.
Return should_finish (set by finish_scan / agent_finish).

Stopping conditions

state.stop_requested — user pressed Esc / Ctrl+C (base_agent.py:615-623)
state.completed set by finish_scan tool
state.iteration >= state.max_iterations (default 300, agents/state.py:22)
Interactive-mode idle + message timeout (default 600s, state.py:28, 119-135)

Error handling

LLMRequestFailedError → _handle_llm_error() (base_agent.py:568-601). In non-interactive mode: mark incomplete and raise. In interactive mode: enter waiting state with llm_failed=True and let the human recover.
asyncio.CancelledError is caught (base_agent.py:232) — a cancelled in-flight tool call still updates state gracefully.
Runtime/Value/TypeError funnel into _handle_iteration_error (base_agent.py:603-613).

Empty-response corrective

If the LLM returns whitespace or no tool call (base_agent.py:379-393), the loop injects a synthetic user turn reminding the agent it must issue a tool call. This is a cheap guardrail against the model saying "Sure, I'll do that now" and stopping.

2. Agent State (`strix/agents/state.py`)

AgentState is a pydantic.BaseModel carrying the entire live state — not just the message log.

Field	Purpose
`agent_id`, `agent_name`, `parent_id`	Identity + position in the agent graph (lines 13-15)
`sandbox_id`, `sandbox_token`, `sandbox_info`	Handle the executor uses to reach the FastAPI tool server (lines 16-18)
`task`, `iteration`, `max_iterations`	Plan + budget (lines 20-22)
`completed`, `stop_requested`, `waiting_for_input`, `llm_failed`	Control flags (lines 23-26)
`waiting_start_time`, `waiting_timeout`	Idle-timeout tracking (27-28)
`final_result: dict`	Payload `agent_finish` writes back up (29)
`messages: list[dict]`	Full conversation history — role, content, and optional `thinking_blocks` (32)
`actions_taken`, `observations`, `errors`	Structured audit trail per iteration (38-76)
`context: dict`	Free-form scratchpad (loaded skills, shared keys) (33)

Serialized with model_dump() and stashed in the graph registry so parents can introspect children. Subagent state is handed in at construction via the agent_state kwarg (base_agent.py:67-74).

3. LLM Wrapper (`strix/llm/llm.py`)

Non-trivial glue over litellm.acompletion. Responsibilities:

Provider resolution

resolve_strix_model() in llm/utils.py:47-61 checks STRIX_MODEL_MAP (keys like strix/claude, strix/gpt-5.4) and rewrites the litellm model string + api_base. Unknown names pass through to litellm directly, so the universe of usable providers is "whatever litellm supports" (OpenAI, Anthropic, Vertex, Bedrock, Azure, Ollama, OpenRouter, custom OpenAI-compatible endpoints).

Streaming + early stop

LLM.generate() (llm/llm.py:173-209) yields LLMResponse objects asynchronously. After each chunk, it concatenates the accumulated text and looks for </function>. Once seen, it stops streaming a few chunks later (buffer for closing tags). The rationale: the system prompt forbids multiple tool calls per message, so anything after the first </function> is throwaway.

Final response construction

stream_chunk_builder from litellm reassembles a full response object so cost/usage stats are accurate (llm.py:201).
Tool calls are parsed via parse_tool_invocations (utils.py:80-107) with regex:
- <function=([^>]+)> for name
- <parameter=([^>]+)>(.*?)</parameter> for params
Legacy-format shims (utils.py:12-31) convert old <invoke name="X">...<parameter name="Y"> blocks emitted by older Claude versions into the canonical <function=X> / <parameter=X> format.
Only the first <function>...</function> is kept (utils.py:64-77) — drops malformed / multi-call messages.

System prompt assembly

_load_system_prompt (llm/llm.py:84-142) uses Jinja2:

result = env.get_template("system_prompt.jinja").render(
    get_tools_prompt=get_tools_prompt,            # function callback
    loaded_skill_names=list(skill_content.keys()),
    interactive=self.config.interactive,
    system_prompt_context=self._system_prompt_context,
    **skill_content,                               # skill md as template vars
)

The skill set is computed by _get_skills_to_load() (llm.py:111-125):

ordered_skills = [*self._active_skills]
ordered_skills.append(f"scan_modes/{self.config.scan_mode}")
if self.config.is_whitebox:
    ordered_skills.append("coordination/source_aware_whitebox")
    ordered_skills.append("custom/source_aware_sast")

i.e. user-requested skills → scan mode skill → whitebox coordination skills, deduplicated, with a max of 5 per agent (enforced at load_skill time, skills/__init__.py:63-78).

Message construction

Before sending (llm.py:211-239):

System message with rendered prompt.
Identity block (agent metadata — id, name, parent, sandbox_id) as a hidden marker the model can introspect if needed.
Run MemoryCompressor on the message list (see §4).
If provider supports it (Anthropic), attach cache_control: {"type": "ephemeral"} to the system message. That makes the giant jinja-rendered system prompt cacheable between turns — a big cost win since it can run several MB.

Token & cost accounting

RequestStats dataclass (llm.py:44-58) accumulates input_tokens, output_tokens, cached_tokens, and dollar cost. Extracted from response.usage (regular + prompt_tokens_details.cached_tokens) and litellm.completion_cost() at llm.py:278-315.

Retry strategy

At llm.py:156-172: exponential backoff min(90, 2 * (2**attempt)), default max 5 retries (STRIX_LLM_MAX_RETRIES env), only on statuses that litellm._should_retry() considers retryable.

Reasoning effort

Three-tier resolution (llm.py:74-82):

Config.get("strix_reasoning_effort") — explicit env var.
LLMConfig.reasoning_effort — programmatic override.
Default by scan mode: quick → medium, else high.

Only applied if the provider advertises reasoning via supports_reasoning() (llm capability probe, llm.py:340-344).

4. Memory Compression (`strix/llm/memory_compressor.py`)

The compressor runs every turn before the request is sent. It short- circuits when the total is under 90k tokens.

Budget constants: MAX_TOTAL_TOKENS = 100_000, trigger at 0.9 * MAX = 90_000; always keep at least 15 recent messages; keep 3 most recent images (memory_compressor.py:12-13, 198, 208).
Older messages are chunked (10 at a time, :212) and summarized by an LLM call using SUMMARY_PROMPT_TEMPLATE (:15-43). The prompt explicitly enumerates what must survive: vulnerabilities, credentials, architecture, tool outputs, failed attempts, URLs, payloads, versions, error messages.
Image handling (:134-149): older images are replaced with a text placeholder to avoid blowing up vision tokens.
Timeout: 120s default (memory_compressor.py:161) so a slow compressor can't stall the loop indefinitely.

This is one of the few places Strix spends tokens on "meta" LLM work — accepted trade-off vs. rule-based truncation that would inevitably drop something important on a multi-hour scan.

5. LLM-based Deduplication (`strix/llm/dedupe.py`)

When a subagent emits a create_vulnerability_report, dedupe decides whether it matches an existing finding:

Comparison criteria (dedupe.py:20-31):
- Same root cause (not just same class — "missing input validation" is the root, not "SQL injection")
- Same affected component/endpoint/file
- Same exploitation method
- Would be fixed by the same code change
Negative cases — keep separate: different endpoints, different parameters, different root causes (stored vs reflected XSS), different auth contexts.
The LLM call returns XML: <is_duplicate>, <duplicate_id>, <confidence>, <reason> (parsed at :111-139).
Gated behind DEDUPE_SYSTEM_PROMPT (:14) that tells the classifier model to be conservative — prefer splits over merges.

Why LLM-based rather than string hashing? Reports vary by wording, exploit payload, line number. A literal hash can't tell that two reports both describe the same missing auth check in different words.

6. Tool-Call Format

Canonical format emitted by the LLM and parsed by utils.py:80-107:

<function=terminal_execute>
<parameter=command>sqlmap -u "https://target/item?id=1" -p id --batch</parameter>
<parameter=timeout>60</parameter>
</function>

Design notes:

Human-readable; every provider can emit it as plain text.
Streams well — the TUI's streaming_parser.py renders parameters as they arrive.
HTML-entity decoded after parsing so <-escaped payloads pass through intact (utils.py:102).
Strict one-call-per-message: the parser drops everything after the first </function> (utils.py:64-77), and the system prompt repeats the rule four times (system_prompt.jinja:376-402). This keeps the loop linearizable.

The inverse direction — tool → LLM — uses:

<tool_result>
  <tool_name>terminal_execute</tool_name>
  <result>… stdout …</result>
</tool_result>

If a tool result dict contains a screenshot key, the base64 image is hoisted into a vision message attached to the tool_result (tools/executor.py:227-256). Long outputs are truncated to first 4KB + last 4KB (:246-249).

7. Multi-Agent Coordination

The agent graph lives in module-level dicts on BaseAgent:

_agent_graph — parent→children adjacency
_agent_instances — live BaseAgent objects
_agent_states — serialized AgentState dumps
_agent_messages[agent_id] = list[Message] — inter-agent mailbox

(base_agent.py:119-150, 456)

Spawning (via `create_agent` tool)

agents_graph_actions.create_agent (tools/agents_graph/ agents_graph_actions.py:384-492) spawns a new StrixAgent in a background thread, passing parent_id + optionally inherited conversation history + a focused skill set (1–5 skills per the rules in root_agent.md). Subagents inherit the parent's sandbox_info — same container, same tool server, same bearer token.

Messaging

send_message_to_agent and wait_for_message form the IPC. Messages are wrapped in an <inter_agent_message> XML block (base_agent.py:491-514) so the agent can syntactically tell them apart from tool results. Fields: from, content, message_type, priority, timestamp.

Completion

Subagents call agent_finish — records result_summary, findings, bubbles messages up to parent (agents_graph_actions.py:567-685).
Root agent calls finish_scan — synchronously verifies all subagents are completed before flushing the final report (tools/finish/finish_actions.py).

The system prompt makes this explicit (system_prompt.jinja:233-238):

All agents run in the same shared Docker container for efficiency. Each agent has its own browser/terminal sessions. All agents share /workspace and proxy history.

Trade-off: cheap, discoverable coordination (see your sibling's traffic in Caido); no per-agent network isolation.

8. Guardrails (code-level)

Guardrail	Where
Iteration cap (default 300) with warnings at 85% and last 3 iters	`base_agent.py:186-211`, `state.py:22`
Waiting timeout in interactive mode	`base_agent.py:261-285`, `state.py:119-135`
One tool call per message (parser truncates, prompt repeats the rule)	`utils.py:64-77`, `system_prompt.jinja:376-402`
Empty-response corrective	`base_agent.py:379-393`
LLM retry with exponential backoff	`llm.py:156-172`
Max 5 skills per agent	`skills/__init__.py:63-78`
120s per-tool timeout in sandbox	`runtime/tool_server.py:100-110`, `STRIX_SANDBOX_EXECUTION_TIMEOUT`
Memory compression at 90k tokens	`memory_compressor.py:12-13, 208`
Response-size truncation for tool results (>10KB)	`tools/executor.py:246-249`
PII scrubbing on telemetry payloads	`telemetry/utils.py` (scrubadub + regex)
`STRIX_LLM` required / API key optional (supports IAM-based providers)	`interface/main.py:52-255`
Screenshot key redaction (key+value) before telemetry export	`telemetry/utils.py`

9. Things To Learn From / Pitfalls

Good ideas:

Streaming the tool call with an early-stop on </function> is a neat cost optimization that few other frameworks bother with.
Prompt caching the entire rendered jinja is a big win given how long Strix's system prompts get (especially with 5 skills loaded).
LLM-driven compression with explicit preservation instructions is more reliable than sliding-window truncation for security contexts.
Identity block stamped into every message (llm.py:215-226) lets the agent answer "who am I and who's my parent" without needing extra tools.
XML tool calls as provider-agnostic — no vendor lock-in.

Potential pitfalls:

Shared container across agents = shared browser cookies, shared proxy history. In a multi-target scan, one agent can inadvertently pollute another's state. Strix's answer is per-agent browser sessions, but tmux and Caido are global.
LLM dedupe is stochastic — a run where dedupe decides "not duplicate" on a close call bloats the report, and vice versa.
Default cap of 300 iterations is generous; agents can still miss something if the LLM stalls. A shallow-termination detector (consecutive no-progress iterations) would help.
Global module-level dicts for the agent graph make it hard to run two scans in one process. Fine for CLI, awkward for a library user.