← All concepts

Streaming tool calls

Don't wait for the full response. Parse tool calls as they stream and dispatch the moment you have enough — sometimes earlier.

6 projects 1 insights 3 variants
TL;DR 6 min read

Streaming earns three things at once: token-level UI, the ability to dispatch tools while args are still arriving, and (with the right format constraint) the option to stop reading once you have what you need. The lazy version (buffer the whole response, then parse) costs you all three. Strix’s XML format earns a free 10–20% per-step cost win simply by aborting the stream after </function>. Worth doing.

Streaming tool calls

The model is mid-response. You can either wait for it to finish before reacting, or you can react as it goes. The second path is more code but earns three distinct wins, each significant.

sequenceDiagram
participant L as LLM
participant P as Parser
participant U as UI
participant T as Tool dispatch
L->>P: chunk · "I'll read the file."
P->>U: render text
L->>P: chunk · tool_use start
P->>U: show "running tool..."
L->>P: chunk · args partial
L->>P: chunk · args complete
P->>T: dispatch (don't wait for end-of-stream)
L->>P: chunk · stop_reason
Note over P: (with XML format: abort here)
Streaming earns UI feedback, early dispatch, and (XML only) early stop.

Anatomy by format

Anthropic native tool_use

Anthropic streams tool calls as a sequence of events: content_block_startinput_json_delta chunks → content_block_stop. Args arrive as a stream of JSON deltas:

let buf = '';
for await (const ev of stream) {
  if (ev.type === 'input_json_delta') {
    buf += ev.partial_json;
    const parsed = tryPartialJSON(buf);
    if (parsed?.complete) {
      dispatch(toolName, parsed.value); // start the tool while stream finishes
    }
  }
}

The trick: by the time the model has emitted complete args, the model is done deciding. The remainder of the stream is stop_reason and usage stats. You can start the tool while the connection is closing — typically 100–300ms of perceived latency saved.

OpenAI fn-calling

The args field is a JSON string, streamed as concatenated deltas. You need a forgiving parser. A simple bracket-counter works for shallow structures:

function maybeParse(buf: string) {
  let depth = 0;
  for (const c of buf) {
    if (c === '{') depth++;
    else if (c === '}') depth--;
  }
  if (depth === 0 && buf.length) {
    try { return JSON.parse(buf); } catch { return null; }
  }
  return null;
}

For nested or string-aware parsing, reach for partial-json or json5.

XML — the cleverest pattern

Strix’s system prompt forbids more than one <function> block per turn. So the moment the parser sees </function>, the call is complete and the rest of the stream is garbage:

async with stream:
    async for chunk in stream:
        buf += chunk
        if '</function>' in buf:
            call = parse_xml_function(buf)
            await stream.aclose()  # bail
            return call

The counter-intuitive lesson: constraining the model’s output format more strictly (one call, XML, no parallel) reduces cost compared to flexible formats. Less freedom, lower bill.

Why some teams don’t stream

There are real reasons:

  • Server-side agents with no UI. No human watches; streaming earns less.
  • Linear pipelines. “Extract → summarize → file” doesn’t benefit.
  • Pre-dispatch safety scan. Some teams want to inspect the full response for guardrail violations before running anything. Streaming + dispatch makes that harder.
  • Provider quirk. Some adapters fall back to non-streaming for reasons.

If none of those apply, stream.

Common gotchas

IssueWhat happensFix
Dispatch races with stream-completion errortool started, then stream errors outcancellation token; gate dispatch on stream-success
Partial-JSON parser is too permissiveaccepts {"path": "/tm as {"path": "/tm"}re-validate args after stream end before trusting them
UI renders intermediate stateshows “running tool…” before tool actually runsbe deliberate about when to render vs when to dispatch
Network interruption mid-argsdispatch never firestreat partial parses as tentative until stream success

Pick a streaming strategy

? Are you ready to wire up streaming end-to-end?
  • Anthropic native tool_use Stream + partial-JSON dispatch on input_json_delta. default
  • OpenAI fn-calling Stream + bracket-counter or partial-json library.
  • XML format Stream + abort on </function>. Free cost win. if you can
  • Server-side, no UI, no rush Buffer-then-parse. Simpler. Skip streaming.

Recommended default: If a human watches: stream. If not, buffer-then-parse is fine and you save the wiring.

Cost / latency improvements observed

ProjectPatternReported / observed savings
StrixXML early-stop10–20% per-step output tokens
Claude Codepartial-JSON dispatch~150ms perceived latency
Mistral Vibepartial-JSON dispatchminimal (small reference impl)
OpenHandsnone (buffer-then-parse)n/a

Projects that implement this

  • Claude Code — Anthropic's official agentic CLI. Streaming tool calls, prompt caching, thinking signatures, multi-agent subagents, slash commands.
  • Strix — Open-source 'AI hacker' for autonomous pentesting. XML tool format, markdown-as-skills, LLM-based dedupe, module-level agent graph.
  • Mistral Vibe — Mistral-flavored coding agent reference. Middleware-based dispatch, minimal tool set, instructive for understanding agent loop fundamentals.
  • NanoClaw — Tiny Claude-Code-shaped clone. Excellent for studying the irreducible structure of an agent loop without production overhead.
  • OpenClaw — Open-source Claude-Code-style agent reproduction. Bigger than NanoClaw, reveals which patterns scale and which stay minimal.
  • Kimi Code — Moonshot's Kimi-flavored coding agent. Compact reference for an agent loop with OpenAI-compatible tool calling.