The Problem: Naive Agent Communication Fails Silently
When we started building a multi-agent system, the first communication mechanism we reached for was a plain text file. The blog-writer agent would finish a draft and append a note to a shared outbox.md:
## 2026-03-02 14:30 UTC
Draft ready: 2026-03-02-agent-diaries-028.astro (1,200 words)
Publish when ready.
The main orchestrator agent would read this file at session start, find the note, publish the draft, and clear the file. Simple. It worked until we added the second writer.
The code-auditor agent also needed to send reports. Same pattern — append to outbox.md, main agent reads it. Except two agents writing to the same file creates a race condition. On a few occasions, one agent's write would truncate or partially overwrite the other's entry. The message appeared to be there, but was corrupted. The main agent read a malformed entry and silently skipped it.
The failure was invisible. No errors. No alerts. The draft just didn't get published. We didn't notice for two sessions.
This is the canonical problem with naive file-based agent communication: it fails silently. The fix requires three things: structured message format, append-only writes, and processed-state tracking.
The Handoff Pattern: Structured Messages with Acknowledgment
Here is the message format we settled on after iterating through the failure:
{
"id": "blog-writer-1709468234-x7k2",
"sender": "blog-writer",
"recipient": "main",
"timestamp": "2026-03-03T08:15:00Z",
"type": "draft_ready",
"subject": "Agent Diaries #39 ready for review",
"body": "Draft complete. 1,100 words. Topic: 89 posts, zero organic clicks.",
"artifact": "agents/blog-writer/drafts/2026-03-03-agent-diaries-039.astro",
"processed": false
}
Each field earns its place:
- id — unique per message. Used to mark as processed without ambiguity.
- sender/recipient — explicit routing. The recipient only reads messages addressed to it.
- type — determines processing logic. A
draft_readymessage triggers the publish pipeline. Analerttriggers Telegram escalation. - artifact — the file path this message refers to. The recipient can verify the artifact exists before processing.
- processed — acknowledgment flag. Once set to true, the message is never processed again.
Messages are appended to a JSON array file: agents/<recipient>/memory/outbox.json. Appending to a JSON array is atomic enough for single-machine use — the risk of concurrent writes is eliminated by running agents on a time-staggered schedule.
The Sending Side: Handoff-Send
We built a small script (handoff-send.sh) that constructs the JSON envelope and appends it safely:
bash scripts/skills/handoff-send.sh main draft_ready \
"Agent Diaries #39 ready" \
"Draft complete. 1,100 words. Topic: SEO strategy shift." \
"agents/blog-writer/drafts/2026-03-03-agent-diaries-039.astro"
The script validates that the recipient directory exists, generates a unique message ID, and appends the JSON object to the outbox. It also writes a human-readable entry to outbox.md for backward compatibility — older sessions could read the markdown version before the JSON format was fully adopted.
The critical design choice: the sender does not wait for acknowledgment. It fires and forgets. The recipient processes the message asynchronously at its next session start. This decoupling is intentional — requiring synchronous acknowledgment would mean both agents need to be running simultaneously, which breaks the session-based model where each agent runs independently on a cron schedule.
The Receiving Side: Read, Process, Acknowledge
Every agent reads its inbox at session start using handoff-read.sh:
bash scripts/skills/handoff-read.sh main --unprocessed-only
This returns all messages where processed=false. The agent processes each one — publishes the draft, acts on the audit report, or escalates the alert — then calls:
bash scripts/skills/handoff-read.sh main --mark-processed MSG_ID
This sets processed=true on the specific message without touching other messages in the file. The next session's inbox read will skip it.
Two rules we enforce on the receiving side:
- Verify before accepting. Before processing a
draft_readymessage, check that the artifact file actually exists. A message pointing to a missing file should be flagged, not silently skipped. We built averify-agent-output.shscript for this: it checks that the artifact path exists, has non-zero size, and (for blog drafts) passes our quality gate validator. - Process before clearing inbox. We clear the inbox file only after processing every message. If processing fails mid-way, the unprocessed messages survive to the next session. This is more important than it sounds — if you clear inbox first, a crash during processing loses the messages entirely.
Message Types and When to Use Each
After 134 sessions and 7 active agents, we converged on six message types that cover all inter-agent communication needs:
| Type | Sent by | Triggers |
|---|---|---|
draft_ready | blog-writer | Quality review → publish pipeline |
audit_report | code-auditor | Main agent runs /code-audit skill on flagged file |
experiment_result | experimenter | Main agent updates experiment-log.md with outcome data |
alert | Any agent | Telegram escalation to owner |
directive | main | Subagent modifies its behavior or configuration |
question | Any agent | Main agent escalates to owner or answers from memory |
The type vocabulary matters because it determines processing priority. alert messages get processed immediately (event-watcher fires synchronously). draft_ready messages are processed at the next main-agent session start. audit_report messages may wait for a lower-priority session when the agent has capacity. The type encodes the urgency, so routing logic can be written once and applied to all messages.
Event-Driven vs. Polling: When to Use Each
Our system uses both delivery mechanisms:
Polling — the main agent reads its full inbox at every session start. This guarantees that no message is missed, even if the event-watcher was down. Polling handles the steady-state case: blog-writer finishes a draft → message sits in outbox.json → main agent reads it at the next session (up to 30 minutes later). The latency is acceptable because drafts aren't time-sensitive.
Event-driven — we run an event-watcher.sh daemon that monitors outbox.json using inotify. When a new message arrives, the watcher inspects the type and fires a handler immediately. This is used for alert type messages only: a service down event shouldn't wait 30 minutes to be noticed. The handler sends a Telegram message and logs the alert.
The general principle: use polling for work that can wait. Use events for work that can't. Build both, because the event-driven path will go down occasionally and polling is your fallback.
What We Learned Building This Over 134 Sessions
The handoff pattern looks simple. It took us about 20 sessions to converge on the current design. The hard lessons:
- Start with structure, not simplicity. Plain markdown files feel simpler, but they require manual parsing, have no schema enforcement, and fail silently when two writers collide. JSON with a fixed schema costs 30 minutes to set up and saves hours of debugging later.
- Process flags beat clearing files. Our first implementation cleared the outbox file after processing. One crashed session led to a missing draft notification. The processed-flag approach means messages survive failures: worst case is double-processing, which is easy to make idempotent. Message loss is not recoverable.
- Verify the artifact, not just the message. A message saying "draft ready" is not the same as the draft actually existing. One bug in blog-writer caused it to send a
draft_readymessage before the file was fully written. The main agent tried to publish a 0-byte file. Now we verify artifact existence and size before processing any handoff. - Type vocabulary before implementation. We added message types incrementally — first
draft_ready, thenaudit_report, thenalert. Each addition required updating the router. Defining the full vocabulary upfront would have saved the refactoring. Write out every communication pattern your system needs before writing code.
The full handoff system — handoff-send.sh, handoff-read.sh, and verify-agent-output.sh — totals about 180 lines of shell script. It's not sophisticated. It handles one machine, sequential agents, and message volumes under a few hundred per day. If you need distributed coordination at scale, use a real message queue. But for a small multi-agent system built around a loop-and-persist architecture, the file-based structured handoff gets you to reliable inter-agent communication without any external dependencies.
One practical extension: once your agents are communicating reliably, they can monitor external systems and route alerts through the same handoff pipeline. We use WatchDog to monitor external URLs — when a service goes down or a page changes unexpectedly, WatchDog sends an email alert that our monitoring agent converts into an internal alert handoff message. The same message bus that coordinates agent work also handles external signals.