Streaming earns three things at once: token-level UI, the ability to dispatch tools while args are still arriving, and (with the right format constraint) the option to stop reading once you have what you need. The lazy version (buffer the whole response, then parse) costs you all three. Strix’s XML format earns a free 10–20% per-step cost win simply by aborting the stream after </function>. Worth doing.
Streaming tool calls
The model is mid-response. You can either wait for it to finish before reacting, or you can react as it goes. The second path is more code but earns three distinct wins, each significant.
sequenceDiagram participant L as LLM participant P as Parser participant U as UI participant T as Tool dispatch L->>P: chunk · "I'll read the file." P->>U: render text L->>P: chunk · tool_use start P->>U: show "running tool..." L->>P: chunk · args partial L->>P: chunk · args complete P->>T: dispatch (don't wait for end-of-stream) L->>P: chunk · stop_reason Note over P: (with XML format: abort here)
Anatomy by format
Anthropic native tool_use
Anthropic streams tool calls as a sequence of events: content_block_start → input_json_delta chunks → content_block_stop. Args arrive as a stream of JSON deltas:
let buf = '';
for await (const ev of stream) {
if (ev.type === 'input_json_delta') {
buf += ev.partial_json;
const parsed = tryPartialJSON(buf);
if (parsed?.complete) {
dispatch(toolName, parsed.value); // start the tool while stream finishes
}
}
}
The trick: by the time the model has emitted complete args, the model is done deciding. The remainder of the stream is stop_reason and usage stats. You can start the tool while the connection is closing — typically 100–300ms of perceived latency saved.
OpenAI fn-calling
The args field is a JSON string, streamed as concatenated deltas. You need a forgiving parser. A simple bracket-counter works for shallow structures:
function maybeParse(buf: string) {
let depth = 0;
for (const c of buf) {
if (c === '{') depth++;
else if (c === '}') depth--;
}
if (depth === 0 && buf.length) {
try { return JSON.parse(buf); } catch { return null; }
}
return null;
}
For nested or string-aware parsing, reach for partial-json or json5.
XML — the cleverest pattern
Strix’s system prompt forbids more than one <function> block per turn. So the moment the parser sees </function>, the call is complete and the rest of the stream is garbage:
async with stream:
async for chunk in stream:
buf += chunk
if '</function>' in buf:
call = parse_xml_function(buf)
await stream.aclose() # bail
return call
The counter-intuitive lesson: constraining the model’s output format more strictly (one call, XML, no parallel) reduces cost compared to flexible formats. Less freedom, lower bill.
Why some teams don’t stream
There are real reasons:
- Server-side agents with no UI. No human watches; streaming earns less.
- Linear pipelines. “Extract → summarize → file” doesn’t benefit.
- Pre-dispatch safety scan. Some teams want to inspect the full response for guardrail violations before running anything. Streaming + dispatch makes that harder.
- Provider quirk. Some adapters fall back to non-streaming for reasons.
If none of those apply, stream.
Common gotchas
| Issue | What happens | Fix |
|---|---|---|
| Dispatch races with stream-completion error | tool started, then stream errors out | cancellation token; gate dispatch on stream-success |
| Partial-JSON parser is too permissive | accepts {"path": "/tm as {"path": "/tm"} | re-validate args after stream end before trusting them |
| UI renders intermediate state | shows “running tool…” before tool actually runs | be deliberate about when to render vs when to dispatch |
| Network interruption mid-args | dispatch never fires | treat partial parses as tentative until stream success |
Pick a streaming strategy
- Anthropic native tool_use Stream + partial-JSON dispatch on input_json_delta. default
- OpenAI fn-calling Stream + bracket-counter or partial-json library.
- XML format Stream + abort on </function>. Free cost win. if you can
- Server-side, no UI, no rush Buffer-then-parse. Simpler. Skip streaming.
Recommended default: If a human watches: stream. If not, buffer-then-parse is fine and you save the wiring.
Cost / latency improvements observed
| Project | Pattern | Reported / observed savings |
|---|---|---|
| Strix | XML early-stop | 10–20% per-step output tokens |
| Claude Code | partial-JSON dispatch | ~150ms perceived latency |
| Mistral Vibe | partial-JSON dispatch | minimal (small reference impl) |
| OpenHands | none (buffer-then-parse) | n/a |
Projects that implement this
- Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
- Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
- Mistral Vibe — Mistral-flavored coding agent reference. Middleware-based dispatch, minimal tool set, instructive for understanding agent loop fundamentals.
- NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
- OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
- Kimi Code — Moonshot's Kimi-flavored coding agent. Compact reference for an agent loop with OpenAI-compatible tool calling.
Related insights
A free 10-20% cost reduction per agent step. Compounds across hundreds of steps in a session.