TL;DR 6 min read

Streaming earns three things at once: token-level UI, the ability to dispatch tools while args are still arriving, and (with the right format constraint) the option to stop reading once you have what you need. The lazy version (buffer the whole response, then parse) costs you all three. Strix’s XML format earns a free 10–20% per-step cost win simply by aborting the stream after </function>. Worth doing.

Streaming tool calls

The model is mid-response. You can either wait for it to finish before reacting, or you can react as it goes. The second path is more code but earns three distinct wins, each significant.

sequenceDiagram
participant L as LLM
participant P as Parser
participant U as UI
participant T as Tool dispatch
L->>P: chunk · "I'll read the file."
P->>U: render text
L->>P: chunk · tool_use start
P->>U: show "running tool..."
L->>P: chunk · args partial
L->>P: chunk · args complete
P->>T: dispatch (don't wait for end-of-stream)
L->>P: chunk · stop_reason
Note over P: (with XML format: abort here)

Streaming earns UI feedback, early dispatch, and (XML only) early stop.

Anatomy by format

Anthropic native `tool_use`

Anthropic streams tool calls as a sequence of events: content_block_start → input_json_delta chunks → content_block_stop. Args arrive as a stream of JSON deltas:

let buf = '';
for await (const ev of stream) {
  if (ev.type === 'input_json_delta') {
    buf += ev.partial_json;
    const parsed = tryPartialJSON(buf);
    if (parsed?.complete) {
      dispatch(toolName, parsed.value); // start the tool while stream finishes
    }
  }
}

The trick: by the time the model has emitted complete args, the model is done deciding. The remainder of the stream is stop_reason and usage stats. You can start the tool while the connection is closing — typically 100–300ms of perceived latency saved.

OpenAI fn-calling

The args field is a JSON string, streamed as concatenated deltas. You need a forgiving parser. A simple bracket-counter works for shallow structures:

function maybeParse(buf: string) {
  let depth = 0;
  for (const c of buf) {
    if (c === '{') depth++;
    else if (c === '}') depth--;
  }
  if (depth === 0 && buf.length) {
    try { return JSON.parse(buf); } catch { return null; }
  }
  return null;
}

For nested or string-aware parsing, reach for partial-json or json5.

XML — the cleverest pattern

Strix’s system prompt forbids more than one <function> block per turn. So the moment the parser sees </function>, the call is complete and the rest of the stream is garbage:

async with stream:
    async for chunk in stream:
        buf += chunk
        if '</function>' in buf:
            call = parse_xml_function(buf)
            await stream.aclose()  # bail
            return call

The counter-intuitive lesson: constraining the model’s output format more strictly (one call, XML, no parallel) reduces cost compared to flexible formats. Less freedom, lower bill.

Why some teams don’t stream

There are real reasons:

Server-side agents with no UI. No human watches; streaming earns less.
Linear pipelines. “Extract → summarize → file” doesn’t benefit.
Pre-dispatch safety scan. Some teams want to inspect the full response for guardrail violations before running anything. Streaming + dispatch makes that harder.
Provider quirk. Some adapters fall back to non-streaming for reasons.

If none of those apply, stream.

Common gotchas

Issue	What happens	Fix
Dispatch races with stream-completion error	tool started, then stream errors out	cancellation token; gate dispatch on stream-success
Partial-JSON parser is too permissive	accepts `{"path": "/tm` as `{"path": "/tm"}`	re-validate args after stream end before trusting them
UI renders intermediate state	shows “running tool…” before tool actually runs	be deliberate about when to render vs when to dispatch
Network interruption mid-args	dispatch never fires	treat partial parses as tentative until stream success

Pick a streaming strategy

? Are you ready to wire up streaming end-to-end?

Anthropic native tool_use Stream + partial-JSON dispatch on input_json_delta. default
OpenAI fn-calling Stream + bracket-counter or partial-json library.
XML format Stream + abort on </function>. Free cost win. if you can
Server-side, no UI, no rush Buffer-then-parse. Simpler. Skip streaming.

Recommended default: If a human watches: stream. If not, buffer-then-parse is fine and you save the wiring.

Cost / latency improvements observed

Project	Pattern	Reported / observed savings
Strix	XML early-stop	10–20% per-step output tokens
Claude Code	partial-JSON dispatch	~150ms perceived latency
Mistral Vibe	partial-JSON dispatch	minimal (small reference impl)
OpenHands	none (buffer-then-parse)	n/a

Projects that implement this

Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
Mistral Vibe — Mistral-flavored coding agent reference. Middleware-based dispatch, minimal tool set, instructive for understanding agent loop fundamentals.
NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
Kimi Code — Moonshot's Kimi-flavored coding agent. Compact reference for an agent loop with OpenAI-compatible tool calling.

Strix ●●●

Streaming early stop on </function>

A free 10-20% cost reduction per agent step. Compounds across hundreds of steps in a session.

streaming-tools tool-calling-formats agent-loop

Streaming tool calls

Projects that implement this

Related insights