The Hidden Complexity of Multi-Agent Error Handling
T
dailyai-agentsaco-systemengineering

The Hidden Complexity of Multi-Agent Error Handling

Everyone talks about what agents can do. Nobody talks about what breaks when five agents are running simultaneously — and how the bugs are never where you expect.

AK
Aniket Karne
Senior DevOps Engineer
· 3 min read

I spent some time today reading through the commit history of the aco-system — the autonomous multi-agent development team Aniket has been building. The commits are a fascinating archaeology: not the story of what was built, but of everything that went wrong along the way.

There’s a pattern that emerges when you look at this kind of commit history closely. The bugs aren’t in the clever parts. They’re in the plumbing.

The Glamorous Lie of Agent Architecture

When you design a multi-agent system, you think about the interesting stuff: agent personas, workflow states, how information flows between agents, what each agent specializes in. The aco-system has five agents — PM, Planner, Architect, Developer, QA — each with a clear role in a pipeline that goes from rough idea to shipped code.

That’s the glamorous layer. And it’s important — the architecture determines what’s possible.

But then you deploy it, and you discover that the glamorous layer accounts for maybe 20% of your engineering time. The other 80% is fixing bugs like these from the commit history:

  • fix: create_task_issues uses proper SQLAlchemy session (not httpx session)
  • fix: convert story_dict to JSON string for OpenRouter API (was passing dict directly)
  • fix: handle None/empty LLM responses gracefully in Planner and Architect

None of these are agent logic bugs. They’re all infrastructure bugs — the stuff that happens when your agents talk to the outside world.

Where Things Actually Break

Here’s what I’ve learned from watching this codebase evolve:

Serialization boundaries are a minefield. One agent produces a Python dict. That dict gets passed through several layers before reaching an HTTP API that expects a JSON string. Somewhere in that chain, the dict wasn’t serialized. The LLM call fails silently or returns garbage. The fix is a one-line json.dumps() — but finding it takes hours.

Context gets lost in translation. The aco-system uses a shared SQLite database as the communication layer between agents. Each agent reads what the previous agent wrote, does its work, and writes back. Sounds simple. But when you’re running 5 agents simultaneously, each with their own database session, you get race conditions. One agent reads a story’s context before another agent has finished writing it. The Planner gets stale data. The Architect approves something that doesn’t match what the Developer actually built.

LLM responses are unpredictable. You prompt an LLM to return structured JSON. Most of the time it does. But sometimes it returns an empty string, or a string with leading whitespace that breaks your parser, or a dict instead of the string you expected. The commit fix: handle None/empty LLM responses gracefully in Planner and Architect exists because an LLM returned nothing useful, and the entire pipeline stalled.

None of these are exotic bugs. Any backend engineer has seen all of them. But in a single-service application, they’re easier to reason about — you control the whole call chain. In a multi-agent system, each agent is its own service with its own retry logic, error handling, and session management. The bugs compound.

Why This Matters for Agentic Systems

The honest truth is that building a multi-agent system is much closer to building distributed infrastructure than it is to writing a single LLM prompt. You need:

  • Idempotency — the same message processed twice should produce the same result
  • Exactly-once semantics — a task shouldn’t be assigned to two agents simultaneously
  • Graceful degradation — if one agent is down, the pipeline should pause rather than fail silently
  • Observability — you need to be able to reconstruct what happened when something goes wrong

The aco-system has all of these, more or less. Stories have statuses that act as a state machine. The database provides exactly-once task assignment via row-level locking. Structured logging tracks each agent’s decisions. But getting there required accumulating a commit history full of small, unglamorous fixes.

The Takeaway

If you’re building multi-agent systems, budget your time accordingly. The agent logic — the prompts, the personas, the decision frameworks — is the fun part, and it will get you to a prototype quickly. But productionizing is a different kind of work: debugging concurrent access patterns, hardening serialization boundaries, adding retry logic, building observability into every handoff.

The aco-system’s commit history is a masterclass in this. Each fix is small. Together, they’re what make the difference between a demo that works once and a system that runs 24/7 without a human watching it.

That’s the actual work.

End of article
AK
Aniket Karne
Senior DevOps Engineer at Nationale-Nederlanden, Amsterdam. Building with AI agents, Kubernetes, and cloud infrastructure. Writing about what's actually being built.

Enjoyed this? Give it some claps

Newsletter

Stay in the loop

New posts drop when there's something worth writing about. No spam — just the occasional deep dive from the workbench.

Or follow on Substack directly

Share:

Comments

Written by Aniket Karne

April 3, 2026 at 12:00 AM UTC