When multiple agents are alive at once and need to coordinate, how they communicate is more constrained than it looks. Three patterns exist: tool-result return (children just return to parents — no mailbox at all), module-level dicts (Strix’s elegant in-process solution), and external queues (Redis/SQS/Kafka). Most teams reach for the third when the first or second is correct. Build the simplest queue that works.
Mailbox & message-passing
Multi-agent systems make people reach for distributed-systems infrastructure on day one. They are usually wrong.
The default: tool-result is the only message
Most multi-agent systems in the corpus don’t have peer-to-peer messaging. Children don’t message siblings. They return to the parent. The parent decides whether to invoke another child with the result.
This works because the parent’s loop is already doing dispatch — a child is just an unusually long-running tool. Nothing new to wire up. No queues, no race conditions, no order-of-delivery questions.
Most teams should use this pattern. Until they’re forced into something fancier.
When you actually need a mailbox
If two agents run concurrently and need to share state mid-flight:
- A long-running monitor agent reporting to a coordinator.
- Worker agents claiming work from a shared queue.
- Producer/consumer between specialists.
Strix uses this for inter-agent reporting during long pentests: a sub-agent that finds a credential while testing one endpoint posts it to other agents that might benefit.
Strix’s module-level approach
class BaseAgent:
_agent_messages: dict[str, list[Msg]] = {}
def send_to(self, target_id: str, msg: Msg):
BaseAgent._agent_messages.setdefault(target_id, []).append(msg)
def receive(self) -> list[Msg]:
return BaseAgent._agent_messages.pop(self.id, [])
Plain dict. No locks needed because Python’s GIL serializes individual dict operations — setdefault().append() is atomic, pop(key, []) is atomic. Single-process only. Cleaned up when the process exits.
Don’t reach for Redis early
External queues have real costs:
- Latency (every message is a round-trip to a separate service).
- Operational dependency (now you have a Redis to keep running).
- Debug complexity (state is in two places).
Reach for them when:
- Agents run on different machines.
- You need durable messages that survive a process crash.
- You want pub/sub across many subscribers.
Pick a mailbox
- Children just return to parents No mailbox. Use tool-result return. default
- Concurrent agents share state in one process Module-level dicts à la Strix.
- Agents on different machines External queue (Redis / SQS).
- Need durable messages across crashes External queue with persistence.
Recommended default: Default to tool-result. Move up the ladder only when forced.
Pitfalls
- Mailbox + retry + dedupe is a hard distributed-systems problem. Skip it if you can.
- Cycles in the agent graph that pass messages → infinite chatter → bankruptcy. Cap message count or hops.
- No back-pressure. A slow consumer falls behind, queue grows unbounded. Cap queue size; drop or block.
Related insights
For one-process multi-agent coordination, plain Python dicts are the right answer. No Redis, no broker, no race conditions you need locks for.