Streaming early stop on </function>

Strix’s tool format is XML — <function=...><parameter=...>...</parameter></function>. The system prompt is explicit: one function call per assistant turn, no exceptions.

This makes a free cost optimization possible. Once the streaming parser sees </function>, the model has finished saying everything you care about. Anything after is hallucinated rambling — and you’re being billed for it.

async for chunk in stream:
    buf += chunk
    if '</function>' in buf:
        call = parse_xml_function(buf)
        await stream.aclose()  # ← bail
        return call

That’s it. One line of cancellation drops 10-20% of per-step output tokens (Strix’s reported figure).

Why this isn’t possible with native tool_use

Anthropic’s tool_use blocks have an explicit content_block_stop event for each block, but the stream continues until message_stop to deliver final usage statistics and stop_reason. You can theoretically abort the stream after the tool_use block stops, but the stream architecture is built around the model deciding when it’s done. Strix’s win comes from forcing the contract via prompt design — the agent guarantees one call, so the tool boundary is also the conversation boundary.

Why it works in practice

Models are highly compliant when the system prompt is unambiguous about output format.
The XML parser is simple to write defensively.
Errors (model emits two function blocks anyway) are loud — parse fails — and recoverable.

Counterintuitive takeaway

Constraining the model’s output format more strictly (one call, XML, no parallel) reduces total cost vs the more flexible native tool_use that allows multi-call. Less flexibility, lower cost.

Sources

strix/02_agent_and_llm.md:32 ? unverified