CodeDocs Vault

Agent Loop & LLM Layer

The "brain" of Strix: how one iteration of the ReAct loop actually runs, how state is carried between iterations, how the LLM is wrapped, and the tricks played on top (memory compression, LLM-based dedupe, streaming tool-call parsing, prompt caching).


1. The Agent Loop

Implemented in BaseAgent.agent_loop() at strix/agents/base_agent.py:152-260. It's an async def driven by streaming LLM output:

while True:
    if stop_requested: break                  # base_agent.py:163
    process_incoming_messages()               # :168 — other agents can poke us
    if waiting_for_input: await message()     # :170-172 — idle in interactive
    if should_stop(): break                   # :174-178 — completed / max_iter
    state.iteration += 1
    warn_if_nearing_budget()                  # :186-211 — "3 iters left!"
    ok = await _process_iteration(tracer)     # :214-217 wrapped in Task
    if ok: break

One _process_iteration does roughly:

  1. Call llm.generate(state.messages) as an async stream.
  2. Feed each chunk into the tracer so the TUI can render live.
  3. Stop streaming early once </function> is seen (≤5 chunks of slack, llm/llm.py:184-197) — saves tokens because only the first tool call is honored anyway.
  4. Build the completed response (stream_chunk_builder, llm/llm.py:201), extract token stats and thinking_blocks.
  5. Parse XML tool calls via parse_tool_invocations (llm/utils.py:80-107).
  6. process_tool_invocations(actions, history, state) — dispatch each action through the executor, append the <tool_result> to state.messages.
  7. Return should_finish (set by finish_scan / agent_finish).

Stopping conditions

Error handling

Empty-response corrective

If the LLM returns whitespace or no tool call (base_agent.py:379-393), the loop injects a synthetic user turn reminding the agent it must issue a tool call. This is a cheap guardrail against the model saying "Sure, I'll do that now" and stopping.


2. Agent State (strix/agents/state.py)

AgentState is a pydantic.BaseModel carrying the entire live state — not just the message log.

Field Purpose
agent_id, agent_name, parent_id Identity + position in the agent graph (lines 13-15)
sandbox_id, sandbox_token, sandbox_info Handle the executor uses to reach the FastAPI tool server (lines 16-18)
task, iteration, max_iterations Plan + budget (lines 20-22)
completed, stop_requested, waiting_for_input, llm_failed Control flags (lines 23-26)
waiting_start_time, waiting_timeout Idle-timeout tracking (27-28)
final_result: dict Payload agent_finish writes back up (29)
messages: list[dict] Full conversation history — role, content, and optional thinking_blocks (32)
actions_taken, observations, errors Structured audit trail per iteration (38-76)
context: dict Free-form scratchpad (loaded skills, shared keys) (33)

Serialized with model_dump() and stashed in the graph registry so parents can introspect children. Subagent state is handed in at construction via the agent_state kwarg (base_agent.py:67-74).


3. LLM Wrapper (strix/llm/llm.py)

Non-trivial glue over litellm.acompletion. Responsibilities:

Provider resolution

resolve_strix_model() in llm/utils.py:47-61 checks STRIX_MODEL_MAP (keys like strix/claude, strix/gpt-5.4) and rewrites the litellm model string + api_base. Unknown names pass through to litellm directly, so the universe of usable providers is "whatever litellm supports" (OpenAI, Anthropic, Vertex, Bedrock, Azure, Ollama, OpenRouter, custom OpenAI-compatible endpoints).

Streaming + early stop

LLM.generate() (llm/llm.py:173-209) yields LLMResponse objects asynchronously. After each chunk, it concatenates the accumulated text and looks for </function>. Once seen, it stops streaming a few chunks later (buffer for closing tags). The rationale: the system prompt forbids multiple tool calls per message, so anything after the first </function> is throwaway.

Final response construction

System prompt assembly

_load_system_prompt (llm/llm.py:84-142) uses Jinja2:

result = env.get_template("system_prompt.jinja").render(
    get_tools_prompt=get_tools_prompt,            # function callback
    loaded_skill_names=list(skill_content.keys()),
    interactive=self.config.interactive,
    system_prompt_context=self._system_prompt_context,
    **skill_content,                               # skill md as template vars
)

The skill set is computed by _get_skills_to_load() (llm.py:111-125):

ordered_skills = [*self._active_skills]
ordered_skills.append(f"scan_modes/{self.config.scan_mode}")
if self.config.is_whitebox:
    ordered_skills.append("coordination/source_aware_whitebox")
    ordered_skills.append("custom/source_aware_sast")

i.e. user-requested skills → scan mode skill → whitebox coordination skills, deduplicated, with a max of 5 per agent (enforced at load_skill time, skills/__init__.py:63-78).

Message construction

Before sending (llm.py:211-239):

  1. System message with rendered prompt.
  2. Identity block (agent metadata — id, name, parent, sandbox_id) as a hidden marker the model can introspect if needed.
  3. Run MemoryCompressor on the message list (see §4).
  4. If provider supports it (Anthropic), attach cache_control: {"type": "ephemeral"} to the system message. That makes the giant jinja-rendered system prompt cacheable between turns — a big cost win since it can run several MB.

Token & cost accounting

RequestStats dataclass (llm.py:44-58) accumulates input_tokens, output_tokens, cached_tokens, and dollar cost. Extracted from response.usage (regular + prompt_tokens_details.cached_tokens) and litellm.completion_cost() at llm.py:278-315.

Retry strategy

At llm.py:156-172: exponential backoff min(90, 2 * (2**attempt)), default max 5 retries (STRIX_LLM_MAX_RETRIES env), only on statuses that litellm._should_retry() considers retryable.

Reasoning effort

Three-tier resolution (llm.py:74-82):

  1. Config.get("strix_reasoning_effort") — explicit env var.
  2. LLMConfig.reasoning_effort — programmatic override.
  3. Default by scan mode: quick → medium, else high.

Only applied if the provider advertises reasoning via supports_reasoning() (llm capability probe, llm.py:340-344).


4. Memory Compression (strix/llm/memory_compressor.py)

The compressor runs every turn before the request is sent. It short- circuits when the total is under 90k tokens.

This is one of the few places Strix spends tokens on "meta" LLM work — accepted trade-off vs. rule-based truncation that would inevitably drop something important on a multi-hour scan.


5. LLM-based Deduplication (strix/llm/dedupe.py)

When a subagent emits a create_vulnerability_report, dedupe decides whether it matches an existing finding:

Why LLM-based rather than string hashing? Reports vary by wording, exploit payload, line number. A literal hash can't tell that two reports both describe the same missing auth check in different words.


6. Tool-Call Format

Canonical format emitted by the LLM and parsed by utils.py:80-107:

<function=terminal_execute>
<parameter=command>sqlmap -u "https://target/item?id=1" -p id --batch</parameter>
<parameter=timeout>60</parameter>
</function>

Design notes:

The inverse direction — tool → LLM — uses:

<tool_result>
  <tool_name>terminal_execute</tool_name>
  <result>… stdout …</result>
</tool_result>

If a tool result dict contains a screenshot key, the base64 image is hoisted into a vision message attached to the tool_result (tools/executor.py:227-256). Long outputs are truncated to first 4KB + last 4KB (:246-249).


7. Multi-Agent Coordination

The agent graph lives in module-level dicts on BaseAgent:

(base_agent.py:119-150, 456)

Spawning (via create_agent tool)

agents_graph_actions.create_agent (tools/agents_graph/ agents_graph_actions.py:384-492) spawns a new StrixAgent in a background thread, passing parent_id + optionally inherited conversation history + a focused skill set (1–5 skills per the rules in root_agent.md). Subagents inherit the parent's sandbox_info — same container, same tool server, same bearer token.

Messaging

send_message_to_agent and wait_for_message form the IPC. Messages are wrapped in an <inter_agent_message> XML block (base_agent.py:491-514) so the agent can syntactically tell them apart from tool results. Fields: from, content, message_type, priority, timestamp.

Completion

Why share a container?

The system prompt makes this explicit (system_prompt.jinja:233-238):

All agents run in the same shared Docker container for efficiency. Each agent has its own browser/terminal sessions. All agents share /workspace and proxy history.

Trade-off: cheap, discoverable coordination (see your sibling's traffic in Caido); no per-agent network isolation.


8. Guardrails (code-level)

Guardrail Where
Iteration cap (default 300) with warnings at 85% and last 3 iters base_agent.py:186-211, state.py:22
Waiting timeout in interactive mode base_agent.py:261-285, state.py:119-135
One tool call per message (parser truncates, prompt repeats the rule) utils.py:64-77, system_prompt.jinja:376-402
Empty-response corrective base_agent.py:379-393
LLM retry with exponential backoff llm.py:156-172
Max 5 skills per agent skills/__init__.py:63-78
120s per-tool timeout in sandbox runtime/tool_server.py:100-110, STRIX_SANDBOX_EXECUTION_TIMEOUT
Memory compression at 90k tokens memory_compressor.py:12-13, 208
Response-size truncation for tool results (>10KB) tools/executor.py:246-249
PII scrubbing on telemetry payloads telemetry/utils.py (scrubadub + regex)
STRIX_LLM required / API key optional (supports IAM-based providers) interface/main.py:52-255
Screenshot key redaction (key+value) before telemetry export telemetry/utils.py

9. Things To Learn From / Pitfalls

Good ideas:

Potential pitfalls: