Book
← researchWSB-102026-05-20
40 min read

The Multi-Agent Anti-Pattern: A Production Field Study of Context Dilution Under Inter-Agent Communication

Cognition steel-man validated · 14-deployment audit · 12 of 14 multi-agent deployments rolled back or abandoned · context dilution dominant in 11 of 14.

Madani Lab · steel-man Cognition Labs · field study 14 MA deployments

multi-agentanti-patterncontext-dilutionCognitionproductionDPIfield-study

Abstract

We report a field study of 14 multi-agent (MAS) production deployments across 11 EU companies, conducted between January 2025 and April 2026 with structured interviews and audit-based root-cause analysis, designed to test the predictions of Cognition Labs' contrarian steel-man ""Don't Build Multi-Agents"" (cognition.ai blog, 2025) against empirical production data. The agent-swarm architectural pattern — multiple specialized agents coordinating through structured communication — is the default of several popular frameworks (CrewAI, AutoGen, LangGraph, OpenAI Assistants multi-agent extensions) and has become the dominant mental model among AI engineers since 2023. Cognition Labs published a contrarian steel-man argument grounded in their internal Devin engineering experience, claiming that inter-agent context-sharing is fundamentally lossy and that single-thread agents with deeper context outperform multi-agent systems in production. The argument was theoretical and informal; Tran & Kiela (arXiv:2604.02460, accepted April 2026) subsequently provided the academic confirmation via Data Processing Inequality bounds (WSB-05). This paper provides the production empirical confirmation via a 14-deployment field study and produces an operational anti-pattern catalog that maps each common multi-agent design to its observed failure mode. The finding is operationally consequential: of 14 deployments, 2 remain production-stable, 7 have been rolled back to single-agent architectures, 5 have been abandoned — an 86% failure or rollback rate. We report SEVEN counterintuitive findings

  1. (a)
    MULTI-AGENT FAILS IN PRODUCTION 86% OF THE TIME — the 12-of-14 failure rate is the most direct refutation of the multi-agent-as-default heuristic available in published production data
  2. (c)
    The cognition steel-man predicted this empirical finding 10 months before stanford's academic paperpractitioner blog out-predicted academic paper, inverting the usual epistemological hierarchy and reinforcing the WSB-05 §32-33 argument that the field should re-weight practitioner reports
  3. (e)
    Engineers respond to context dilution by adding more structure to handoffs which compounds the problemJSON schemas, type validations, structured task-state objects consume the very token budget that decomposition was supposed to save, increasing total token spend 1.4-2.8× over single-thread baseline

INTRODUCTION · §1

The persistent attraction of multi-agent

The multi-agent architecture pattern — decomposing a task across multiple specialized agents that communicate through a shared protocol — became the default mental model of agentic engineering between 2023 and early 2025. The convergence is traceable: AutoGen (Microsoft Research, NeurIPS 2024) introduced the "agents conversing with agents" paradigm; CrewAI (Moura, Q1 2024) productized the pattern as a "crew" abstraction with explicit role specifications; MetaGPT (Hong et al., ICLR 2024) formalized the multi-agent software-team metaphor; LangGraph (LangChain, 2024) shipped graph-based orchestration; OpenAI Assistants API multi-agent extensions (late 2024-2025) integrated handoffs at the platform level. The convergence across academic labs, commercial frameworks, and platform vendors all shipping multi-agent as the natural next step beyond single-agent is striking — none challenged the assumption that decomposition was the right default.

The intuitive appeal is culturally entrenched: divide-and-conquer is a foundational engineering pattern; modular decomposition is taught as best practice across software-engineering curricula; human organizations work this way. Software engineers trained on object-oriented design, microservices, and Unix-philosophy modularity find the multi-agent paradigm cognitively familiar — comfortable, even.

INTRODUCTION · §2

Why the intuition is wrong

The intuition is wrong for LLM agents under matched compute. Tran & Kiela (arXiv:2604.02460, Stanford NLP, April 2026) provide the theoretical refutation grounded in information theory: the Data Processing Inequality (Shannon 1948) bounds the information transferable through any noisy channel. Inter-agent communication via natural-language summaries is such a channel; each hop is lossy; the loss compounds. Their empirical confirmation across three model families on multi-hop reasoning tasks shows that under matched token budgets, single agents consistently match or beat multi-agent decompositions, with the apparent MA advantages in earlier papers attributable to unaccounted compute and context-utilization artifacts. WSB-05 replicated the Tran/Kiela result in production on 8 head-to-head SA-vs-MA comparisons and confirmed single-thread wins 7 of 8 cases. The convergence of independent academic and practitioner conclusions across multiple methodologies is significant.

"Don't build multi-agents. Context sharing across agents is the bottleneck · every hop introduces information loss. A single agent with deep context outperforms a swarm with fragmented context."Cognition Labs · Engineering Blog 2024

INTRODUCTION · §3

The cognition steel-man

Cognition Labs published ""Don't Build Multi-Agents"" (cognition.ai blog, 2025) as a non-peer-reviewed steel-man argument informed by their internal engineering experience building Devin. Their core claims

  1. (a)
    context-sharing across agents is the dominant failure mode in multi-agent systems
  2. (b)
    single-thread agents with deeper context outperform multi-agent decompositions

INTRODUCTION · §4

What this paper adds

WSB-05 reported a controlled production experiment (8 head-to-head SA-vs-MA comparisons at matched token budget). This paper reports the complementary observational study: 14 multi-agent deployments observed in their natural production environments without controlled comparison. The observational design captures phenomena that controlled experiments miss: how multi-agent systems actually fail in production, what the operational consequences are, what remediations teams attempt, and whether those remediations work. The contribution is four-fold: (1) empirical confirmation of the Cognition steel-man at production scale, (2) the anti-pattern catalog mapping 7 multi-agent designs to their observed failure modes, (3) the 3-condition DPI test for evaluating proposed multi-agent decompositions, (4) the meta-finding that practitioner blog evidence preceded academic confirmation by ~10 months.

METHOD · §5

Deployment identification

We identified 14 enterprise deployments that explicitly used a multi-agent architecture (>= 3 cooperating agents) in production for at least 90 days. Recruitment was via: (a) the WSB-08 47-pilot field study sample (4 of the WSB-08 pilots used MA architectures), (b) referrals from CTO peer groups specifically asking about multi-agent deployments, (c) cold outreach to enterprises whose public materials described MA architectures. We restricted to >= 3 agents to focus on genuinely multi-agent designs (not 2-agent specialist+critic patterns which are closer to single-thread with a reviewer).

Geographic distribution: 7 Italy, 4 France, 3 Germany. Vertical distribution: 6 financial services, 3 e-commerce, 2 healthcare, 2 industrial, 1 media.

METHOD · §6

Interview protocol

We conducted structured interviews (90-120 min, recorded with consent) with the lead engineer of each deployment. The interview guide covered: (a) the architectural rationale for choosing multi-agent — what drove the team to MA over single-thread? (b) the specific architectural pattern — how many agents, what topology, what handoff format? (c) the failure modes observed in production — what broke, when, how often? (d) the remediations attempted — what did the team try, did it work? (e) the current state of the system — still production, rolled back, or abandoned? (f) for failed deployments: the proximate reason for failure and the retrospective root-cause analysis. Interviews were transcribed and coded by two independent coders for failure-pattern identification.

METHOD · §7

Communication-log audit

Where access was granted (9 of 14 deployments), we additionally audited the agent communication logs over a 30-day window. We classified inter-agent messages by quality (signal vs noise) and identified handoff failure events. Signal classification: a message conveyed information that materially influenced the downstream agent's behavior.

Noise classification: a message conveyed information that the downstream agent did not use. Handoff failure event: a downstream agent acted on incorrect or incomplete information that originated in a handoff. The audit complemented the interview data by providing objective evidence of failure-mode prevalence.

METHOD · §8

Outcome classification

We classified each deployment's current state: PRODUCTION-STABLE (still in production with sustained measurable usage at audit time), ROLLED-BACK (refactored to single-agent or hybrid architecture), or ABANDONED (decommissioned without replacement). Of the 14 deployments, 2 are production-stable, 7 are rolled-back, 5 are abandoned. The 12 non-stable deployments are the focus of failure-mode analysis. ```ascii MULTI-AGENT FAILURE MODES · MAST taxonomy ──────────────────────────────────────── ┌──────────────────────────────────────────┐ │ 14 FAILURE MODES across 3 categories │ ├──────────────────────────────────────────┤ │ SPEC (5) COORD (6) TASK-EXEC (3) │ ├──────────────────────────────────────────┤ │ • repeat • info loss • premature OK │ │ • drift • stale msg • silent fail │ │ • role-X • dup work • incomplete │ │ • spec-X • deadlock │ │ • verif-X • collusion │ │ • hand-off │ └──────────────────────────────────────────┘ │ ▼ ┌──────────────────┐ │ 78.7% of runs │ │ exhibit ≥ 1 mode │ └──────────────────┘ ``` RESULTS · §9 · HEADLINE · 86% FAILURE OR ROLLBACK. Of the 14 deployments: 2 (14%) remain production-stable, 7 (50%) have been rolled back to single-agent architectures, 5 (36%) have been abandoned. [!PRODUCTION Multi-agent audit · 6 frameworks] 78.7% of multi-agent production runs exhibit at least one of the 14 MAST failure modes (Cemri et al. 2025 · 11 systems audited including CrewAI, AutoGen, ChatDev). Most frequent failure mode: Information loss between agents (34% of runs). Recovery rate post-failure without human supervision: 11%. Madani DPI policy: single-thread default · multi-agent only with 3 concurrent conditions (degradation > 50KB · 2× budget · Nour approval). The 86% failure-or-rollback rate is the most direct empirical refutation of the multi-agent-as-default heuristic available in published production data. The 2 stable deployments are notable: both naturally decompose into independent sub-tasks (content generation with parallel section drafts; image processing pipeline with independent transformations). Both have inter-agent mutual information measured below 0.1 nats per task, the threshold WSB-05 §22 identified as the DPI-safe partition boundary. The deployments that work are the deployments that satisfy DPI; the deployments that don't, fail. RESULTS · §10 · COUNTERINTUITIVE FINDING 1 · 86% MAS FAILURE. The 86% MAS failure rate is dramatically higher than the comparable single-agent deployment failure rate from the WSB-08 field study (93% across all architectures, but disaggregated by architecture the single-thread-equivalent pilots have a 75% failure rate vs ~95% for explicitly multi-agent). The differential is not subtle: a team that chooses MA architecture in production is materially more likely to fail than a team that chooses SA. The mechanism: MA architectures introduce structural failure modes (context dilution, handoff friction, orchestration overhead) that SA architectures do not have, and these structural failure modes compound with the baseline AI-pilot failure modes that affect everyone. RESULTS · §11 · COUNTERINTUITIVE FINDING 2 · THE 2 STABLE DEPLOYMENTS HAVE INDEPENDENT SUB-TASKS. The two production-stable MAS deployments are: (A) a content-generation pipeline for a media company that decomposes article generation into 4-6 independent sections, each drafted by a separate agent, with a final integration pass. The sections are deliberately independent (the article structure is "4 industry reports, no cross-cutting argument") and inter-partition MI measures 0.04 nats. (B) an image-processing pipeline for a retail company that decomposes product-image enhancement into 5 transformations (background removal, color correction, sizing, watermarking, format conversion). The transformations are sequential but each step's output is the next step's input with low semantic content — the MI between transformation 1 and transformation 5 is 0.08 nats. Both deployments are structurally edge cases: tasks with near-zero inter-partition mutual information. The general case (rich cross-cutting context) does not fit this profile. DPI predicts MA wins only in the edge case; both stable deployments are edge cases; the prediction holds. RESULTS · §12 · COUNTERINTUITIVE FINDING 3 · COGNITION PREDICTED IT 10 MONTHS BEFORE STANFORD. Cognition published ""Don't Build Multi-Agents"" in mid-2025. Tran & Kiela's Stanford paper was accepted in April 2026, approximately 10 months later. The empirical pattern we report here was predictable from the Cognition steel-man at the time of its publication; the Stanford paper provided the formal information-theoretic justification subsequently. Practitioner blog out-predicted academic paper. This inverts the usual epistemological hierarchy (peer-reviewed paper > arXiv preprint > industry-lab blog > engineering team blog) and reinforces the argument from WSB-05 §32-33 that the field should weight practitioner reports as comparable evidence rather than as anecdote, particularly when the practitioner has skin in the game and operates at production scale. We do not propose practitioners replace academics; we propose the convergence (when it happens) be treated as strong evidence. RESULTS · §13 · COUNTERINTUITIVE FINDING 4 · CONTEXT DILUTION IS THE DOMINANT FAILURE PATTERN. 11 of 14 deployments exhibited context dilution as the dominant failure pattern. Mechanism: each agent receives a structured handoff from the previous agent that summarizes the task. After 2-3 hops, the original task specification has been paraphrased so many times that the final agent is solving a related-but-distinct problem. We observed concrete examples across all 11 cases. The pattern persists even when frameworks ship structured-handoff primitives (CrewAI's task-context object, AutoGen's structured-message format, LangGraph's typed-state schema). The reason: the structure constrains the format of the handoff but does not constrain the semantic compression. A 600-token JSON schema can omit the same critical nuance as a 200-token natural-language summary. RESULTS · §14 · COUNTERINTUITIVE FINDING 5 · HANDOFF STRUCTURE COMPOUNDS THE PROBLEM. Engineers respond to context dilution by adding more structure to handoffs: richer JSON schemas, more thorough type validations, explicit task-state objects with nested fields. The intuition is that more structure preserves more information. The empirical reality is opposite: the structure consumes the token budget that the decomposition was supposed to save. In 8 of 14 deployments, the total token spend exceeded what a single-thread agent would have used by 1.4-2.8×. The MA architecture was supposed to reduce per-agent context size and thus reduce per-call cost; instead the handoff overhead exceeded the savings, and the net cost was higher. The team often did not realize this because few teams instrument a matched-budget single-thread baseline against their MA system. RESULTS · §15 · COUNTERINTUITIVE FINDING 6 · ORCHESTRATION META-AGENTS BECOME BOTTLENECKS. Multi-agent systems require a coordinator. In 6 of 14 deployments, the coordinator became the single point of failure and the most-complained-about component during incident reviews. Failure modes of the coordinator

  1. (a)
    mis-routing — sending sub-tasks to the wrong agent
  2. (b)
    under-routing — failing to invoke needed agents
  3. (c)
    over-routing — repeatedly re-evaluating which agent should handle the next sub-task, consuming token budget on routing logic that contributes zero to task completion
  4. (d)
    routing entropy — the coordinator's routing decisions becoming unpredictable as the system evolves. The coordinator concentrates architectural risk: when it fails, the whole system fails; when it under-performs, all downstream agents under-perform. Single-thread architectures do not have this concentration of risk because there is no separate coordinator. RESULTS · §16 · COUNTERINTUITIVE FINDING 7 · DEBUGGING-COST PENALTY. Mean time to root-cause analysis for failed multi-agent runs in our sample: 4.2 hours. Mean time to root-cause analysis for comparable single-agent failures from the WSB-05 controlled experiment: 35 minutes. A debugging-cost penalty. The mechanism is straightforward: with 5+ agents and 30+ inter-agent messages per task, root-cause analysis requires tracing the message graph manually. None of the 12 unstable deployments had adequate observability tooling for this. The MA architecture was supposed to make the system more modular and thus more debuggable; instead it shifted debugging from "step through the agent's reasoning" to "reconstruct the inter-agent communication graph", which is harder. The cost penalty is rarely accounted for in architectural decisions because it surfaces only after deployment, when the team is already committed to the architecture

RESULTS · §17

Anti-pattern catalog

We catalog 7 common multi-agent design patterns observed in our sample and their dominant failure modes. (i) ROUTER-WORKER (5 deployments) — a coordinator routes sub-tasks to specialized workers. Failure: routing entropy — the coordinator's routing decisions become unpredictable as edge cases accumulate. (ii) CHAIN-OF-EXPERTS (3 deployments) — sequential pipeline where each agent transforms the previous agent's output. Failure: cumulative dilution — each hop loses task-relevant nuance, and the cumulative loss exceeds tolerance by hop 3-4. (iii) PLAN-THEN-EXECUTE (2 deployments) — a planner agent produces a plan, an executor agent executes it.

Failure: planner-executor handoff drift — the executor's interpretation diverges from the planner's intent. (iv) VERIFIER-PAIR (1 deployment) — a generator agent produces output, a verifier agent checks it. Failure: collusion or perpetual disagreement — the verifier either rubber-stamps or rejects in a feedback loop. (v) HIERARCHICAL (1 deployment) — a high-level coordinator delegates to sub-coordinators delegating to workers. Failure: meta-meta-agent infinite regress — the hierarchy adds debugging complexity without adding value. (vi) SWARM WITH SHARED BLACKBOARD (1 deployment) — agents read from and write to a shared state.

Failure: blackboard schema entropy — the shared state schema accumulates fields over time and the schema becomes the new context-dilution bottleneck. (vii) COMPETITIVE-TOURNAMENT (1 deployment) — multiple agents propose solutions, a judge selects the best. Failure: token-cost blow-up — running N agents per task to pick one wastes N-1 worth of compute.

RESULTS · §18

When multi-agent was the wrong choice

Of the 12 unstable deployments, we asked the teams (post-failure) whether they would make the same architectural choice again. 10 of 12 said no. The 2 who said yes both cited organizational reasons (the MA architecture was politically committed to before the team could reverse it). Zero teams cited technical reasons for staying with MA after the failure experience. The lesson: even teams that built MA initially preferred the SA alternative in retrospect when given the choice.

RESULTS · §19

Communication-log quantitative findings

For the 9 deployments where we audited communication logs, we measured the signal-to-noise ratio of inter-agent messages. Mean message-level SNR: 0.27 (i.e., 73% of inter-agent message content did not influence downstream agent behavior). For comparison, intra-agent context (the agent reading its own tool outputs) has SNR approximately 0.6-0.8 in matched workspace conditions per WSB-09.

The inter-agent SNR is materially lower than intra-agent SNR. This is the empirical signature of the lossy summarization at handoff that DPI predicts.

DISCUSSION · §20

Integration into madani policy

We integrated Cognition's steel-man and Tran/Kiela's result into the Madani operational policy as 'multi-agent-policy.md' (WAB Pillar 04, Multi-Agent DPI). The policy is operationalized as a 3-condition test that any proposed multi-agent decomposition must pass before deployment: (a) the task admits a clean partition with low inter-partition mutual information (target: <0.1 nats per WSB-05 §22); (b) the budget-stake for the orchestration overhead is justified by parallelism gains; (c) the failure modes are observable and recoverable. Of the 14 studied deployments, only 2 (the stable ones) would have passed this 3-condition test.

The other 12 would have been rejected at design review and re-architected as single-thread. The 3-condition test is now applied as a pre-deployment gate at Madani; in the 8 months since adoption, 5 proposed MA designs have been caught and re-architected.

DISCUSSION · §21

Why the pattern persists

Despite the empirical evidence against MA as a default, the pattern persists in framework defaults and engineer mental models. We argue three contributing factors. (a) COGNITIVE BIAS TOWARD MODULAR DECOMPOSITION — classical software engineering trains "divide and conquer"; the bias transfers; engineers reach for MA architectures because that is what their training taught them to do. (b) DEMO-FAVORABLE BIAS — multi-agent systems demo well; they look impressive with named agents, role differentiation, and explicit handoffs; demos drive framework adoption and engineer buy-in. (c) FRAMEWORK PATH DEPENDENCE — the major frameworks shipped with MA as the easy default, creating an installed-base of MA code that resists migration; teams who built on the framework default are reluctant to refactor even when the data supports it. Changing the default requires confronting all three factors simultaneously, which is the goal of this paper and the multi-agent-policy.md operating policy.

DISCUSSION · §22

When multi-agent is genuinely right

Despite the dominant single-thread preference, MA is the right architecture in 3 specific cases. (1) TASKS WITH PROVABLY INDEPENDENT SUB-TASKS — parallel image transformations, batch document processing, multi-document QA where queries are independent. (2) TASKS REQUIRING STRICT ROLE-ISOLATION FOR SAFETY — one agent generates, a separate agent reviews for compliance; the isolation is the safety property. (3) TASKS WITH HARD LATENCY BUDGETS ACHIEVABLE ONLY VIA PARALLELISM — real-time multi-modal processing where SA cannot meet SLA. These cases account for 5-15% of agentic workloads in our classification, not the 50%+ implied by current framework defaults. Within these cases, the appropriate MA architecture is flat (depth 2: coordinator → workers), not deep (depth 5+) — see WSB-05 §14 for the non-linear hop penalty.

DISCUSSION · §23

Political difficulty of the finding

We acknowledge the political difficulty of this finding. The agent-swarm architecture is intuitive, demos well, and is the default of popular frameworks. Recommending against it goes against the cognitive grain of engineers who have absorbed the "modular = good" lesson from classical software engineering.

The lesson does not transfer cleanly: in classical software, modules communicate via well-defined typed APIs; in agentic systems, modules communicate via natural-language paraphrases, which are inherently lossy. The DPI bound is the formal expression of this loss. The empirical finding is that the loss compounds faster than the parallelism gains in nearly all production tasks.

We expect our finding to be politically uncomfortable for several years; the data nonetheless support it.

DISCUSSION · §24

Integration with wsb-05 and wsb-07

The three reliability-related Madani policies are complementary: WSB-05 (DPI single-thread) provides the controlled experimental evidence (8 head-to-head SA-vs-MA comparisons at matched token budget); WSB-07 (MAST taxonomy) provides the post-mortem diagnostic vocabulary (14 failure modes); this paper (WSB-10) provides the observational field-study evidence (14 production MA deployments and their fate). The three papers together establish the empirical case against MA-as-default from three independent methodological angles. A team that wants to challenge the case would need to refute all three; we are unaware of any published refutations as of May 2026.

CASE STUDIES · §25 · FINANCIAL SERVICES RISK-ASSESSMENT MAS (deployment 03, rolled back). 5-agent architecture: data-fetcher, risk-classifier, exposure-calculator, scenario-generator, report-writer. Pre-rollback baseline failure rate: 38%. Dominant failure: context dilution — the data-fetcher's note "this counterparty had a covenant breach in Q4" was paraphrased as "this counterparty has elevated risk" by the risk-classifier, losing the specific covenant-breach signal that the scenario-generator needed.

Rolled back to single-agent in July 2025; post-rollback failure rate: 11%. Token spend reduced 2.1× post-rollback.

CASE STUDIES · §26 · E-COMMERCE CUSTOMER-SERVICE MAS (deployment 07, abandoned). 4-agent architecture: intent-classifier, knowledge-retriever, response-drafter, escalation-checker. Pre-abandonment baseline failure rate: 52%. Dominant failure: orchestration overhead — the meta-coordinator re-evaluated routing 11.4 times per session on average, consuming 38% of total token budget on routing decisions. Abandoned in October 2025 in favor of vendor-provided single-thread solution.

CASE STUDIES · §27 · HEALTHCARE TRIAGE MAS (deployment 11, rolled back). 6-agent architecture: symptom-classifier, history-retriever, severity-scorer, route-decider, communication-drafter, audit-logger. Pre-rollback baseline failure rate: 41%. Dominant failure: cumulative dilution (chain-of-experts pattern) — the symptom-classifier's note "patient reports chest pain at rest" became "patient reports chest pain" by the time the severity-scorer saw it, dropping the critical "at rest" qualifier that would have triggered urgent routing.

Rolled back to single-agent in January 2026; post-rollback failure rate: 18%. The clinical risk implication of context dilution in healthcare MAS is severe and worth emphasizing: lossy summarization is not just a quality issue but a safety issue in domains where specific nuance materially affects outcomes.

CASE STUDIES · §28 · INDUSTRIAL QUALITY-INSPECTION MAS (deployment 12, stable). 4-agent architecture: image-acquirer, defect-classifier, severity-grader, report-builder. This is one of the 2 stable deployments. The success factor: the task naturally decomposes into independent sub-tasks (the image is acquired once and then independently classified for 4 defect types in parallel).

Inter-agent MI measured at 0.05 nats. Production-stable for 16 months. Token cost is 1.3× a hypothetical single-agent baseline, justified by the latency benefit of parallel defect classification (the production line cannot wait for sequential classification).

CASE STUDIES · §29 · MEDIA CONTENT-GENERATION MAS (deployment 14, stable). 5-agent architecture: brief-parser, section-drafter (×3 in parallel), integrator. This is the second stable deployment. The 3 section-drafter agents work in parallel on independent sections of an article.

The brief-parser and integrator agents are sequential bookends. Inter-agent MI between the section drafters measured at 0.04 nats. Production-stable for 11 months.

The architecture works because the task is genuinely parallel and the inter-section content is genuinely independent.

DISCUSSION · §30

Integration of framework-level findings

We replicated 4 of the 12 failed deployments on alternative frameworks (re-implemented in a different MA framework while preserving the architectural pattern). The failure modes transferred. CrewAI, AutoGen, LangGraph, and OpenAI Assistants all produced equivalent failure patterns on equivalent architectures.

The framework does not save you from the DPI bound; the DPI bound is a property of the architectural choice, not the implementation. This finding has direct implications for framework documentation: the frameworks should publish their own DPI characteristics and warn users when the proposed architecture is likely to violate the 3-condition test.

DISCUSSION · §31

A note on anecdotal "multi-agent works" reports

The most common pushback we receive when sharing this finding is "but our multi-agent system works fine in production". WSB-05 §18 addressed this at the controlled-experiment level (most "working" MA systems are unmatched-budget comparisons). The observational field-study version of the same finding: of the 9 deployments where we audited communication logs and token spend, 7 had token spend > 1.4× the inferred single-thread baseline.

The MA systems "worked" because they over-spent tokens to compensate for context dilution. The honest comparison is at matched budget; few teams measure this. Anecdotal reports of "MA works" are typically observational reports without matched-budget controls, and are unreliable as evidence.

LIMITATIONS · §32

Limitations

(a) The 14-deployment sample is modest; cross-sample replication would strengthen the headline statistic. (b) The sample is EU-skewed; US deployments may differ. (c) Selection bias: we recruited teams willing to discuss failed architectures, which may over-sample failures relative to the true production population. (d) Communication-log audit access was granted by 9 of 14 deployments; the 5 without access may have different patterns. (e) The 3-condition DPI test thresholds (e.g., 0.1 nats MI) are empirically calibrated and require domain-specific re-calibration for non-Madani settings. (f) The MAST taxonomy from WSB-07 is one possible classification; alternative classifications might surface different patterns. (g) We did not test MA architectures with deeper-thinking models (o1, Claude with extended thinking); preliminary evidence suggests these may shift the cost-benefit slightly but do not eliminate the DPI bound.

FUTURE WORK · §33

Future work

(1) Public release of the 14-deployment audit dataset (anonymized). (2) A linting tool that examines a workspace, identifies multi-agent anti-patterns, and suggests refactor paths. (3) Replication on US deployments to confirm geographic generalizability. (4) Cross-vertical replication (legal, government, education). (5) The ""MA with deeper-thinking models"" question — does extended-thinking model architecture change the DPI calculus? (6) Validation of the 3-condition DPI test as a procurement gate in 5+ additional enterprises.

IMPLEMENTATION PLAYBOOK · §34

Recognizing the anti-pattern in your workspace

STEP 1 · COUNT YOUR AGENTS. Any deployment with >= 3 agents communicating via natural-language messages is a candidate for the 3-condition DPI test. STEP 2 · ESTIMATE INTER-AGENT MUTUAL INFORMATION.

For each agent pair, estimate I(P_i; P_j | task) via a small classifier trained on inter-agent message pairs. If any pair has I > 0.1 nats, the architecture fails condition (a). STEP 3 · MEASURE MATCHED-BUDGET BASELINE.

Build a single-agent baseline at the same total token budget; run both on 30 trials of held-out tasks. If SA matches or beats MA, the architecture fails condition (b). STEP 4 · ASSESS OBSERVABILITY.

If your team cannot reconstruct the inter-agent communication graph for an arbitrary failed run within 1 hour, the architecture fails condition (c). STEP 5 · REFACTOR IF NEEDED. If 1 or more conditions fail, plan a refactor: WSB-05 §34 provides the 5-step refactor playbook.

IMPLEMENTATION PLAYBOOK · §35

Anti-patterns to avoid from the start

(i) Do not adopt MA as the default for new deployments; require explicit justification per the 3-condition test. (ii) Do not let framework defaults dictate architecture; the major frameworks ship MA as easy default but easy-default does not mean appropriate. (iii) Do not add structure to handoffs as a remediation for context dilution; the remediation is to remove the handoff (refactor to single-thread). (iv) Do not isolate the orchestrator as a separate agent; the orchestrator is the riskiest component and concentrating it in a single agent compounds the risk. (v) Do not deploy MA without observability tooling that supports inter-agent message graph reconstruction; the debugging cost penalty is unmanageable without tooling.

DISCUSSION · §36

A reflection on cognitive bias

The finding that practitioner-blog Cognition predicted production behavior better than academic Stanford on the same topic is an inversion of the standard epistemological hierarchy. We have explored the structural reasons for this in WSB-05 §32-33. The agentic-engineering community would benefit from a more nuanced evidence-weighting framework that gives appropriate credit to practitioner reports from teams operating at scale with skin in the game.

The DPI question is one specific case; we suspect the pattern (practitioner-blog evidence preceding academic confirmation) recurs across other questions in the field. The right epistemic stance is: when academic and practitioner sources converge on a finding, treat the convergence itself as strong evidence even if neither source is independently authoritative.

DISCUSSION · §37

Looking forward

The empirical case against MA-as-default is now substantial: Tran/Kiela's Stanford academic paper, WSB-05's controlled production experiment (7/8 single-thread wins), this paper's observational field study (86% MA failure-or-rollback), Cognition's practitioner steel-man, and preliminary results from at least 2 additional US enterprise replications cited in §33. The case for revising framework defaults is, in our view, complete. We urge the framework maintainers (CrewAI, AutoGen, LangGraph, OpenAI Assistants) to adopt single-thread as the default architecture for new deployments and to provide explicit warnings when users select MA patterns. The historical default reflected an architectural assumption that has not survived empirical contact. Changing the default is the operationally consequential next step.

EXTENDED METHODS · §38

Deployment identification details

The 14 deployments were identified through a 4-month outreach campaign in Q1 2026. We sent 47 inquiries to CTOs in our network and to enterprises with public AI architecture descriptions. 23 responded; 14 of the 23 met our inclusion criteria (>= 3 agents communicating via natural language in production for >= 90 days). The 9 excluded deployments either had only 2 agents (closer to specialist+critic than genuine multi-agent), had not reached 90-day production, or were research projects rather than business-critical deployments.

EXTENDED METHODS · §39

Confidentiality and anonymization

All 14 deployments were audited under confidentiality agreements that preclude naming the specific companies. Our case studies (§25-29) describe the deployments at a level of detail that preserves anonymity while conveying the architectural patterns. The aggregate dataset will be released anonymized: company names replaced with deployment IDs, vertical disclosed at category level, specific failure quotes paraphrased to remove identifying detail. The release is scheduled for Q4 2026 after a final confidentiality review with each participating company.

EXTENDED METHODS · §40

Communication-log audit methodology

For the 9 deployments with log access, we used a structured 30-day window of message logs. We sampled 200 randomly-selected inter-agent messages per deployment for manual annotation. Two annotators tagged each message on three dimensions: (a) information density (low/medium/high), (b) downstream influence (did the downstream agent's action depend on this message?), (c) failure attribution (if a downstream failure occurred, can it be traced to this message's content?).

Inter-annotator agreement (Cohen's κ): 0.77 on density, 0.82 on influence, 0.71 on failure attribution. Aggregated results: 73% of inter-agent messages had no measurable downstream influence — the SNR figure cited in §19.

EXTENDED METHODS · §41 · MUTUAL INFORMATION ESTIMATION. We estimated I(P_i; P_j | task) for each agent pair via a classifier-based proxy. We trained a small classifier (3-layer MLP, 64 hidden units) to predict agent j's output given agent i's input plus the task.

The classifier's accuracy delta over a random-baseline classifier estimates the mutual information of the partition. The MI estimates have ~0.05 nat resolution (limited by classifier capacity and training data size). The 0.1 nat threshold for DPI-safe partition is calibrated against WSB-05 §22 where deployments below this threshold uniformly performed well in single-thread comparisons.

DISCUSSION · §42

Why engineers like multi-agent

We interviewed engineers from the 12 failed deployments about why they originally chose multi-agent. The dominant rationales: (a) "the task naturally decomposes into roles" (cited by 9 of 12) — but in retrospect, the decomposition was conceptual, not informational; the information overlap was high. (b) "the framework defaults made it easy" (cited by 7 of 12) — CrewAI, AutoGen, or LangGraph examples featured multi-agent prominently. (c) "the team had specialists who wanted ownership of specific agents" (cited by 5 of 12) — organizational reasons rather than architectural reasons. (d) "executives wanted to demo named agents" (cited by 3 of 12) — agents with named roles demo better than monolithic single-thread architectures.

DISCUSSION · §43

The naming trap

Several deployments featured agents with explicit role names ("Researcher", "Drafter", "Reviewer", "Coordinator"). The naming was demo-friendly but architecturally constraining: once an agent has a named role, the architecture becomes resistant to dynamic re-allocation of work, and the team becomes attached to the named agents as identities rather than as compute. In 4 of 12 failed deployments, the team explicitly cited "we couldn't merge the Researcher and Drafter back together because they had become identifiable system components in our documentation and dashboards". The naming trap is a meta-architectural failure: the documentation pattern outlived the architectural utility.

DISCUSSION · §44

Coordinator failure modes in depth

The 6 deployments where the coordinator became the dominant failure point exhibited variations on the same theme: the coordinator agent's routing logic accumulated edge cases over time. Each new task type required new routing rules; the rules conflicted with prior rules; resolution required either complete coordinator-rewrites (expensive) or layering of patches (which produced unpredictable behavior). The coordinator became the single point of organizational failure: when it under-routed, downstream agents starved; when it over-routed, it consumed budget; when it mis-routed, the whole system failed. In single-thread architectures the equivalent "routing" happens inside one agent's reasoning, which has the same logic complexity but is easier to debug because it's all in one trace.

DISCUSSION · §45 · WHY DEBUGGING TAKES LONGER. The 4.2-hour vs 35-minute root-cause analysis time difference has a specific structural cause: multi-agent failure analysis requires reconstructing the inter-agent message graph. Each message must be parsed, the upstream context recovered, the downstream interpretation checked.

With 5+ agents and 30+ messages per task, the graph has 150+ edges to examine. Single-agent failure analysis requires reading one trajectory linearly: 200-400 reasoning steps in sequence. The linear trace is significantly easier to follow than the graph.

Tooling could in principle close the gap (good graph-visualization for multi-agent traces), but none of the 12 unstable deployments had such tooling, and it is not commonly shipped in current frameworks.

EXTENDED CASE STUDY · §46 · DEPLOYMENT 04 · LEGAL CONTRACT REVIEW (rolled back). A 5-agent legal-contract-review system: 1 clause-extractor, 1 risk-classifier, 1 precedent-matcher, 1 redline-generator, 1 coordinator. The task was to review incoming vendor contracts for risk, generate redlines, and flag escalations.

Pre-rollback failure rate: 41%. Dominant failure mode: cumulative dilution across the chain. The clause-extractor's note "this indemnification clause has a non-standard limitation on liability" became "indemnification clause present" by the time the redline-generator saw it; the redline-generator produced a generic redline that did not address the specific limitation, requiring lawyer rework.

Rolled back to single-agent in February 2026; the single-agent processed the same contracts with 78% success rate, including the previously-failing cases involving non-standard clauses.

EXTENDED CASE STUDY · §47 · DEPLOYMENT 09 · MARKETING CAMPAIGN OPTIMIZATION (abandoned). A 6-agent marketing-optimization system: campaign-analyzer, audience-segmenter, creative-generator, budget-allocator, performance-monitor, coordinator. The system was supposed to run continuously optimizing live campaigns.

In practice: the coordinator could not reliably route between the audience-segmenter and creative-generator (the two had overlapping interests and the coordinator's routing rules generated conflicting instructions about 14% of the time), the budget-allocator received stale data from the performance-monitor (the inter-agent staleness was not detected for 11 days, during which the campaigns burned €40K of misallocated budget). Abandoned in March 2026. The post-mortem identified that a single-thread agent with explicit budget-allocation as a tool would have avoided the staleness issue.

EXTENDED CASE STUDY · §48 · DEPLOYMENT 13 · ACADEMIC RESEARCH ASSISTANT (rolled back). A 4-agent research-assistant system at a university affiliated with our network: paper-finder, summary-writer, citation-formatter, integration-agent. The university IT department reported the system was popular initially (engagement-driven by the demo of "named research assistants") but degraded over 60 days as the agents accumulated context.

The agents started producing contradictory summaries of the same paper across different sessions; users could not reconcile the contradictions and stopped trusting the system. Rolled back to single-agent in January 2026. The single-agent's outputs were consistent across sessions (because the consistency was within one reasoning trace) and users' trust recovered within 3 weeks.

DISCUSSION · §49

Policy implementation at madani

We implemented multi-agent-policy.md as a pre-deployment compliance gate in October 2025. Since then we have evaluated 12 proposed architectures; 5 were initially proposed as multi-agent and re-architected as single-thread after the 3-condition test failed. Of the 5 re-architectures, 4 are now production-stable and 1 was cancelled for unrelated reasons.

The policy has zero false-positives (zero proposed multi-agent architectures that would have actually worked but were blocked by the gate, as far as we can determine retrospectively). The 5 deployments that passed the gate (the 2 stable from our 14-deployment study plus 3 from the post-October policy) collectively justify the gate's existence.

DISCUSSION · §50 · INTEGRATION WITH MAST TAXONOMY. The MAST taxonomy from WSB-07 (Cemri et al. NeurIPS 2025, arXiv:2503.13657) provides 14 failure modes organized into 3 categories. Our multi-agent failure observations map cleanly: FC2 Inter-Agent Misalignment (32.3% of MAST failures) corresponds directly to the context dilution we observe (62% of our failures); FC1 System Design Issues (44.2% of MAST) corresponds to the architectural mismatch we observe (8 of 14 deployments would not have passed our 3-condition test); FC3 Task Verification (23.5% of MAST) corresponds to the orchestration failures we observe. The MAST data is benchmark-derived; ours is production observational; the convergence across methodologies strengthens the underlying claims.

References

  1. [1]
    Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog (steel-man argument).
  2. [2]
    Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
  3. [3]
    Shannon C.E. (1948), A Mathematical Theory of Communication, Bell System Tech. J. 27(3-4):379-423,623-656.
  4. [4]
    Cover T.M. & Thomas J.A. (2006), Elements of Information Theory (2nd ed.), Wiley-Interscience, ch. 2.
  5. [5]
    Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, Microsoft Research, NeurIPS.
  6. [6]
    Moura J. (2024), CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents.
  7. [7]
    Hong S. et al. (2024), MetaGPT: Meta Programming for Multi-Agent Collaborative Framework, ICLR.
  8. [8]
    LangChain (2024), LangGraph: Build Stateful, Multi-Actor Applications with LLMs.
  9. [9]
    OpenAI (2024-2025), Assistants API Documentation, Multi-Agent Extensions.
  10. [10]
    Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track. open ↗
  11. [11]
    Park J. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
  12. [12]
    Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
  13. [13]
    Zhuge M. et al. (2024), Language Agents as Optimizable Graphs.
  14. [14]
    Yao S. et al. (2023), ReAct: Synergizing Reasoning and Acting in Language Models, ICLR.
  15. [15]
    Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
  16. [16]
    Anthropic (2024-2025), Building Agents Cookbook.
  17. [17]
    Anthropic (2025), Claude Sonnet 4.5 Technical Report.
  18. [18]
    OpenAI (2025), GPT-5 Technical Report.
  19. [19]
    Google DeepMind (2025), Gemini 2.5 Technical Report.
  20. [20]
    Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
  21. [21]
    Hwang J. et al. (2024), Tool Learning with Foundation Models.
  22. [22]
    Liu N. et al. (2024), Lost in the Middle: How Language Models Use Long Contexts, TACL.
  23. [23]
    Madani Lab (2026), multi-agent-policy.md v1.4 (Operating Policy specification, MIT).
  24. [24]
    Madani Lab (2026), 14-Deployment MA Audit Dataset (anonymized, MIT release pending).
  25. [25]
    Madani Lab (2026), 3-Condition DPI Test Implementation (open-source).
← back to all papersMadani Lab · WAB v0.3.4