Strix’s tool format is XML — <function=...><parameter=...>...</parameter></function>. The system prompt is explicit: one function call per assistant turn, no exceptions.
This makes a free cost optimization possible. Once the streaming parser sees </function>, the model has finished saying everything you care about. Anything after is hallucinated rambling — and you’re being billed for it.
async for chunk in stream:
buf += chunk
if '</function>' in buf:
call = parse_xml_function(buf)
await stream.aclose() # ← bail
return call
That’s it. One line of cancellation drops 10-20% of per-step output tokens (Strix’s reported figure).
Why this isn’t possible with native tool_use
Anthropic’s tool_use blocks have an explicit content_block_stop event for each block, but the stream continues until message_stop to deliver final usage statistics and stop_reason. You can theoretically abort the stream after the tool_use block stops, but the stream architecture is built around the model deciding when it’s done. Strix’s win comes from forcing the contract via prompt design — the agent guarantees one call, so the tool boundary is also the conversation boundary.
Why it works in practice
- Models are highly compliant when the system prompt is unambiguous about output format.
- The XML parser is simple to write defensively.
- Errors (model emits two function blocks anyway) are loud — parse fails — and recoverable.
Counterintuitive takeaway
Constraining the model’s output format more strictly (one call, XML, no parallel) reduces total cost vs the more flexible native tool_use that allows multi-call. Less flexibility, lower cost.
Sources
-
strix/02_agent_and_llm.md:32? unverified