Why do multi-agent AI systems break in production?

Almost always for engineering reasons, not model reasons. The six failure modes we hit most: implicit hand-off contracts, context bloat, invisible state, accidental loops, reviewer agents that don't really review, and brittleness when the underlying model is upgraded. A multi-agent system is a distributed system and breaks like one.

How do you stop AI agents from passing bad data to each other?

Define every inter-agent hand-off as an explicit, validated schema — the agent emits a structured object, not free-form prose, and the receiver reads the object. Free-form hand-offs cause cascading interpretation errors that compound across the chain; a multi-agent system without enforced schemas is a distributed system with no wire protocol.

How do you control cost and runaway loops in agentic systems?

Enforce budgets at every level — per-call timeouts, per-workflow token budgets, per-agent step limits, and circuit breakers that trip when an agent is called too often. Treat every agent invocation as a resource-bounded operation that fails loudly before it does damage; rate limits and capacity ceilings are now the primary operational constraint on agentic systems.

Should you let production AI agents auto-upgrade to new models?

No. Pin model versions explicitly the way you pin library versions, and run any new model against your eval and regression suites before it goes live — a new model generation shifts every agent's behaviour at once. xlabs never auto-upgrades the underlying model on a production agentic workflow.

Multi-agent orchestration in production — what actually breaks

A single agent doing a single thing is a manageable engineering problem. Three agents passing work between each other is a different category of problem entirely. Twelve agents coordinated across a real production workflow is, in our experience, where most teams discover that the demo and the system are not the same thing.

Multi-agent architectures are the pattern showing the strongest correlation with measurable agentic ROI in the 2026 enterprise data. They are also the pattern with the highest distance between "works on a slide" and "works at 3am under load". This piece is about that distance — what actually breaks when you run a multi-agent system in production, and how we engineer around it at xlabs.

This is a technical piece. It assumes you've already built or run an agent or two and are now staring down the architectural question of how to wire several together without producing a mess.

Failure mode one: implicit hand-off contracts

The most common early failure isn't a bug. It's a missing schema.

Every hand-off between agents is, in effect, an API call. In the early days of a build, those calls tend to be loosely shaped — the planning agent emits a free-form text plan, the executor agent reads it, the reviewer agent comments on it. It works in development because the agents are interpreting each other's output charitably and the test cases are narrow.

In production, the slack disappears. A free-form hand-off contract turns into a source of cascading interpretation errors: each agent makes a slightly wrong assumption about what the previous agent meant, and the error compounds across the chain. By the time you're three hops in, the system is doing something coherent but unrelated to the user's intent.

The fix is structural. We define every inter-agent hand-off as an explicit, validated schema — the same way we'd define an API between two microservices. The agent emits a structured object, not a paragraph. The receiving agent reads the object, not the prose. Where we need natural-language reasoning to flow through the system, we attach it as a field on the object alongside the structured payload, not in place of it.

A multi-agent system without enforced hand-off schemas is a distributed system without a wire protocol. It will work in development. It will not survive production.

Failure mode two: context bloat

The naive instinct in multi-agent design is to share everything with everyone. The planning agent has the full conversation history. The executor agent has the full plan plus the full conversation history. The reviewer agent has the executor's output plus the full plan plus the full conversation history. By the fourth hop, each agent is reasoning over a context window that is mostly noise.

This produces three problems. Latency rises because every call is bigger. Cost rises because every token is paid for. Accuracy drops because the model is doing more work to find the relevant signal inside an increasingly diluted context.

The engineering principle is the same as in any well-designed system: each component gets the minimum information it needs to do its job. The planning agent gets the user's intent and a system summary. The executor gets the structured plan and the specific files or tools it needs. The reviewer gets the executor's output and the acceptance criteria, not the full plan.

Context is a resource. Treat it like one.

Failure mode three: invisible state

A single-agent system has obvious state. The conversation is the state. A multi-agent system has distributed state — what the planner believes, what the executor did, what the reviewer rejected, what the retry agent is currently trying — and most early systems make none of that state observable.

The debugging consequence is severe. When something goes wrong, the operator has the user's input and the final (broken) output, and nothing in between. The system has eaten the evidence.

We instrument every agent in a production deployment to emit structured events for every meaningful decision: input received, plan generated, tool called, result returned, error caught, retry attempted. These events flow into the same observability stack as the rest of our application telemetry. We use the same tools — traces, logs, metrics, alerts — that we'd use for any distributed system.

A multi-agent system is a distributed system. Observability is not optional. If you can't replay the trace that produced a wrong output, you cannot fix the system. You can only re-prompt it and hope.

Failure mode four: implicit loops

Agents can call other agents. Agents can call themselves. Agents can call tools that call agents. In a tightly coupled system, it is shockingly easy to introduce a cycle that nobody noticed at design time.

The symptoms in production: latency spikes, cost spikes, occasional infinite loops, and rare but catastrophic runaway runs that absorb thousands of dollars of model spend before someone notices.

The engineering response is the same as in any system that allows recursion: budget enforcement at every level. Per-call timeouts, per-workflow token budgets, per-agent step limits, circuit breakers that trip when an agent is being called too often. We treat any agent invocation as a resource-bounded operation. If it exceeds its budget, it fails — loudly, traceably, and before it does damage.

The 2026 industry guidance on this is consistent: rate limits and capacity ceilings are now the primary operational constraint on agentic systems. Treat them as a first-class engineering concern, not an afterthought.

Failure mode five: the reviewer that doesn't review

A common multi-agent pattern uses a "reviewer" or "critic" agent to evaluate another agent's output. In principle, this provides quality control without needing a human in the loop on every call.

In practice, the reviewer is often the weakest link.

Reviewer agents fail in three ways. They become sycophantic — they tend to confirm whatever the executor produced, especially when the prompting is leading. They become bureaucratic — they reject everything for trivial reasons and create infinite revision loops. Or they become inconsistent — they accept output A in one run and reject identical output A in another, because the model's judgement varies more than the underlying logic justifies.

We engineer around this in three ways. First, reviewer agents are given explicit, structured acceptance criteria — not "is this good?" but "does this satisfy clauses 1 through 7?" Second, we use eval suites — actual test cases with known-good and known-bad inputs — to calibrate the reviewer regularly. A reviewer is a model with a job description; like any model with a job description, it has to be evaluated. Third, where the cost of a wrong review is high, we put a human in the loop — not on every call, but on a sampled basis sufficient to detect drift.

A reviewer agent you haven't evaluated is a reviewer agent you're trusting blindly. We don't.

Failure mode six: brittleness under model upgrades

A multi-agent system is built against the current generation of models. When a new generation lands — and it lands fast in this market — the behaviour of every agent in the system shifts at once.

This isn't a bug; it's a property of the environment. But it means that a system tuned against last quarter's model can break in ways that are confusing to diagnose when the underlying model changes. Prompts that produced reliable structured output now produce slightly different structured output. Reviewer thresholds that were tight are now loose. Hand-off contracts that were respected are now occasionally violated.

The engineering discipline is to treat model versions the way you treat library versions: pin them explicitly in production, upgrade deliberately with regression suites, and run new model versions against your eval suite before they go live in the system. We never auto-upgrade the underlying model on a production agentic workflow. Ever.

What we hold ourselves to

The principles that fall out of these failure modes are the same ones we'd apply to any production-grade distributed system, with a few twists particular to agentic work.

Hand-offs are typed and validated. Context is minimised per agent. State is observable end to end. Budgets are enforced at every level. Reviewers are evaluated. Model versions are pinned and regression-tested. Every meaningful decision emits an event. Every workflow has a kill switch.

Multi-agent orchestration in production is not, in the end, a model problem. It is an engineering problem. The teams that succeed at it are the teams who recognise that and apply the same discipline they would apply to any high-leverage piece of distributed infrastructure.

The model is one component. The system is everything around it. Build the system right and the agents earn their keep. Build the agents without the system and you have a demo that works on Tuesday and breaks on Wednesday.

We've built both. The discipline is the difference.

Multi-agent orchestration in production — what actually breaks

Failure mode one: implicit hand-off contracts

Failure mode two: context bloat

Failure mode three: invisible state

Failure mode four: implicit loops

Failure mode five: the reviewer that doesn't review

Failure mode six: brittleness under model upgrades

What we hold ourselves to

Questions, answered.

More from the studio.

A model is not a system — and that's why most AI stalls

The kindest thing you can do for your idea is try to kill it

Meet Ekko