← Insights

Streaming early stop on </function>

A free 10-20% cost reduction per agent step. Compounds across hundreds of steps in a session.

Strix difficulty 1/3 coststreamingnovel streaming-toolstool-calling-formatsagent-loop

Strix’s tool format is XML — <function=...><parameter=...>...</parameter></function>. The system prompt is explicit: one function call per assistant turn, no exceptions.

This makes a free cost optimization possible. Once the streaming parser sees </function>, the model has finished saying everything you care about. Anything after is hallucinated rambling — and you’re being billed for it.

async for chunk in stream:
    buf += chunk
    if '</function>' in buf:
        call = parse_xml_function(buf)
        await stream.aclose()  # ← bail
        return call

That’s it. One line of cancellation drops 10-20% of per-step output tokens (Strix’s reported figure).

Why this isn’t possible with native tool_use

Anthropic’s tool_use blocks have an explicit content_block_stop event for each block, but the stream continues until message_stop to deliver final usage statistics and stop_reason. You can theoretically abort the stream after the tool_use block stops, but the stream architecture is built around the model deciding when it’s done. Strix’s win comes from forcing the contract via prompt design — the agent guarantees one call, so the tool boundary is also the conversation boundary.

Why it works in practice

  • Models are highly compliant when the system prompt is unambiguous about output format.
  • The XML parser is simple to write defensively.
  • Errors (model emits two function blocks anyway) are loud — parse fails — and recoverable.

Counterintuitive takeaway

Constraining the model’s output format more strictly (one call, XML, no parallel) reduces total cost vs the more flexible native tool_use that allows multi-call. Less flexibility, lower cost.

Sources

  • strix/02_agent_and_llm.md:32 ? unverified