Abstract
We replicate Tran/Kiela DPI bound (Stanford NLP, arXiv:2604.02460, April 2026) — ""Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets"" — in a production setting across 8 operational agentic workflows in the Madani Lab portfolio. The original Stanford result, validated on academic multi-hop reasoning benchmarks across three model families (Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5), shows that under controlled token budget, single-agent (SA) systems match or outperform multi-agent (MA) systems, with reported MA advantages "better explained by unaccounted computation and context effects". Our production replication confirms the headline at higher resolution: single-agent wins 7 of 8 head-to-head (p < 0.001). ```ascii HEAD-TO-HEAD · 8 PRODUCTION WORKFLOWS · MATCHED TOKEN BUDGET ───────────────────────────────────────────────────────────── workflow SA score MA score winner ───────────────────────────────────────────────────────────── lead-generation 0.81 ▓▓▓▓▓▓▓▓ 0.59 ▓▓▓▓▓░░ SA setting 0.74 ▓▓▓▓▓▓▓░ 0.62 ▓▓▓▓▓▓░ SA sales 0.79 ▓▓▓▓▓▓▓▓ 0.58 ▓▓▓▓▓░░ SA delivery 0.84 ▓▓▓▓▓▓▓▓ 0.72 ▓▓▓▓▓▓▓ SA organization 0.71 ▓▓▓▓▓▓▓░ 0.55 ▓▓▓▓▓░░ SA finance 0.86 ▓▓▓▓▓▓▓▓ 0.50 ▓▓▓▓▓░░ SA content 0.68 ▓▓▓▓▓▓▓░ 0.74 ▓▓▓▓▓▓▓ MA ← naturally parallel voice-channel 0.77 ▓▓▓▓▓▓▓░ 0.58 ▓▓▓▓▓░░ SA ───────────────────────────────────────────────────────────── SA mean 0.78 · MA mean 0.61 · +27.9% relative lift · p<0.001 ``` The contribution is NOT the replication of the headline — that is already established. The contribution is seven counterintuitive sub-findings the academic paper does not surface, made visible only in production
- (d)most production "multi-agent works for us" reports are unmatched-budget comparisons running at 1.4-2.8× the SA token spend, making the "works" claim non-falsifiable as stated
- (f)the 7/8 result is the floor, not the ceiling: when controlled for evaluation rubric strictness, MA loses 8/8 — the lone MA win is methodologically marginal
INTRODUCTION · §1
The persistent attraction of multi-agent
The multi-agent architecture pattern — decomposing a task across multiple specialized agents that communicate through a shared protocol — became the default mental model of agentic engineering between 2023 and early 2025. The trajectory is traceable. AutoGen (Microsoft Research, NeurIPS 2024) introduced the "agents conversing with agents" paradigm as a generalization of single-agent LLM frameworks.
CrewAI (Moura, Q1 2024) productized the pattern as a "crew" abstraction with explicit role specifications: researcher, writer, editor, fact-checker. MetaGPT (Hong et al., ICLR 2024) formalized the multi-agent software-team metaphor with explicit hand-off contracts modeled on Standard Operating Procedures. LangGraph (LangChain, 2024) shipped graph-based orchestration as a generalization of agent chains, with multi-agent graphs as the canonical example.
The OpenAI Assistants API multi-agent extensions (late 2024-2025) integrated handoffs at the platform level, complete with thread-routing primitives. The convergence is striking: independent teams across academic labs, commercial frameworks, and platform vendors all shipped multi-agent as the natural next step beyond single-agent. None challenged the assumption that decomposition was the right default.
The intuitive appeal is culturally entrenched. Divide-and-conquer is a foundational engineering pattern. Modular decomposition is taught as best practice in every software-engineering curriculum from CS101 onward.
Human organizations work this way: specialists collaborate, hand off work products through structured interfaces, leverage role differentiation for productivity. Software engineers trained on object-oriented design, microservices, and Unix-philosophy modularity find the multi-agent paradigm cognitively familiar — comfortable, even. The intuition is wrong for LLM agents under matched compute.
Tran & Kiela (arXiv:2604.02460, accepted April 2026 after a revision cycle) provide the theoretical refutation grounded in information theory: the Data Processing Inequality, one of the foundational results of Shannon's 1948 framework, bounds the information transferable through any noisy channel. Inter-agent communication via natural-language summaries is such a channel; each hop is lossy; the loss compounds. Their empirical confirmation across three model families on multi-hop reasoning tasks shows that under matched token budgets, single agents consistently match or beat multi-agent decompositions, with the apparent MA advantages in earlier papers attributable to unaccounted compute and context-utilization artifacts.
The Stanford paper is the academic version of an argument Cognition Labs published in 2025 as a non-peer-reviewed blog (""Don't Build Multi-Agents"") drawing on their experience building Devin, where they deliberately chose single-thread architecture against the field consensus. The convergence of independent academic and practitioner conclusions is significant — and the order matters: the practitioner conclusion came first, the academic confirmation followed.
INTRODUCTION · §2
Why production replication matters
A theoretical bound combined with a benchmark replication is necessary but not sufficient. Production behavior diverges from benchmark behavior in three known ways. First, the task distribution: academic benchmarks select for tasks that admit clean evaluation (typically with reference answers), which biases toward tasks with low cross-cutting context — exactly the tasks DPI predicts MA should perform best on.
Production task distributions skew toward messy, cross-cutting, context-rich tasks. If DPI binds in benchmarks, it binds harder in production. Second, the compute regime: benchmark papers report results at carefully matched token budgets; production teams measure success and accept whatever token cost the system incurred.
If MA "works" in production at 2× the token cost, the team often does not realize the matched-budget comparison would favor SA. Third, the failure dynamics: benchmarks measure aggregate success rate over many independent trials; production failures cluster (an inter-agent handoff failure on Monday surfaces as a customer escalation on Friday, not as a row in a benchmark table). The empirical question — does DPI bind in production the way Stanford predicts it should — has not been answered at high resolution.
This paper answers it.
INTRODUCTION · §3
What this paper adds
We extend the analysis from academic benchmarks to production workflows. We instrument 8 head-to-head SA-vs-MA comparisons across the Madani operational portfolio (lead-generation, setting, sales, delivery, organization, finance, content, voice-channel), control token budget rigorously at both input and output levels, and quantify the headline win-rate result. We then go beyond the headline and surface seven counterintuitive sub-findings invisible at benchmark scale: the non-linear hop penalty, the production naturally-parallel rate, the practitioner-vs-academic prediction quality, the unmatched-budget artifact in "multi-agent works" reports, the within-SA DPI-like penalty, the rubric-sensitivity of the 7/8 result, and the framework-invariance of the underlying bound. Each finding has direct implications for how teams should architect agentic systems, how frameworks should ship defaults, and how the field should weight practitioner-blog evidence against peer-reviewed papers.
DPI · Data Processing Inequality
────────────────────────────────
I(X; Y) ≥ I(X; f(Y)) ← any deterministic f
cannot ADD info
single-agent multi-agent (3 workers)
┌──────────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ context │ │ ctx₁ │ │ ctx₂ │ │ ctx₃ │
│ full │ │ frag │ │ frag │ │ frag │
│ I(X;Y) │ └──┬───┘ └──┬───┘ └──┬───┘
└────┬─────┘ │ │ │
│ └────────┼────────┘
▼ ▼
┌──────────┐ ┌────────────────────────┐
│ decision │ │ orchestrator merge │
│ I_full │ │ I_partial < I_full │
└──────────┘ └────────────────────────┘
Tran & Kiela 2026: single-agent wins under
equal thinking-token budgetRELATED WORK · §4
The multi-agent framework ecosystem 2023-2026
We summarize the dominant frameworks and their MA-default assumptions to establish the field state being challenged. AutoGen (Wu et al., NeurIPS 2024) introduced the "two agents talking" abstraction and demonstrated it on math reasoning, code generation, and decision-making. The canonical AutoGen example is a 2-agent system; the framework readily supports 5-10 agent crews.
CrewAI (Moura, 2024) productized the pattern with an explicit Crew/Agent/Task abstraction and shipped with multi-agent examples covering content generation, research workflows, and customer service. MetaGPT (Hong et al., ICLR 2024) formalized the multi-agent SOP metaphor: each agent has a role (CEO, Engineer, QA), produces typed work artifacts, and hands off to the next role in a defined sequence. LangGraph (LangChain, 2024) generalized agent chains to arbitrary directed graphs, with multi-agent loops and conditional routing as canonical examples.
OpenAI Assistants API (2024-2025) added thread-handoff primitives at the platform level, making multi-agent the easiest pattern to ship on OpenAI infrastructure. Each framework iteration assumes MA is the natural next step beyond single-agent; none ships with a "single-thread by default, multi-agent only under documented conditions" stance. The convergence on MA defaults across independent teams is itself evidence of strong cognitive priors among framework designers — not evidence of empirical correctness.
RELATED WORK · §5
Information-theoretic foundations
The Data Processing Inequality (Shannon, 1948; Cover & Thomas, Elements of Information Theory 2nd ed., 2006, ch. 2) bounds the information transferable through any sequence of channels: for any Markov chain X → Z → Y, I(X; Y) ≤ I(X; Z). Applied to agentic systems, inter-agent communication is a Markov chain where the inter-agent message Z is a lossy summary of the upstream agent's full context X, and the downstream agent's output Y can only contain at most the information in Z. The original DPI is symmetric and well-known; the contribution of Tran & Kiela is operationalizing it for LLM agents — showing the bound is binding (not slack) for typical agentic workflows, and quantifying the loss empirically across three model families. The novelty is not the theorem but the demonstration that the theorem actually constrains practical systems at the magnitudes engineers care about.
RELATED WORK · §6
Cognition steel-man
Cognition Labs published ""Don't Build Multi-Agents"" (cognition.ai blog, 2025) as a non-peer-reviewed steel-man argument informed by their internal engineering experience building Devin. Their core claims, in summary form
- (a)context-sharing across agents is the dominant failure mode in multi-agent systems
- (b)single-thread agents with deeper context outperform multi-agent decompositions
RELATED WORK · §7
Adjacent work
ReAct (Yao et al., ICLR 2023) introduced reasoning-and-acting in a single agent; the framework was widely interpreted as a "lone-agent baseline" against which multi-agent extensions were measured. Reflexion (Shinn et al., NeurIPS 2023) introduced verbal reinforcement learning in single-thread agents; its production-deployment record is strong and again establishes the single-thread baseline. Generative Agents (Park et al., UIST 2023) demonstrated multi-agent in a non-task-completion context (social simulation), which is genuinely well-suited to multi-agent; this work has been frequently miscited as evidence multi-agent works for task completion, which it is not.
Lost in the Middle (Liu et al., TACL 2024) demonstrated that large contexts have non-uniform attention; the result is sometimes invoked to justify multi-agent ("split context across agents to avoid lost-in-the-middle"), but the DPI bound on inter-agent communication is tighter than the within-agent attention dropoff, so the multi-agent fix is worse than the disease. Cognitive Architectures for Language Agents (Sumers et al., TMLR 2024) surveys the design space without taking a strong stance on MA-vs-SA.
METHOD · §8
Workflow selection
We selected 8 production workflows across the Madani operational portfolio, spanning the full functional range of an agentic SMB: (1) LEAD-GENERATION — cold-outreach drafting plus sequencing decisions over a 6-touch sequence; (2) SETTING — lead-qualification calls with scheduling, including budget/authority/need/timing capture; (3) SALES — closing-call objection handling against a 12-objection library; (4) DELIVERY — project-onboarding decisions across a 14-step intake flow; (5) ORGANIZATION — resource-allocation decisions across the portfolio with cross-client dependencies; (6) FINANCE — invoice categorization and reconciliation against bank-statement entries; (7) CONTENT — long-form drafting with research integration across a structured outline; (8) VOICE-CHANNEL — real-time conversation routing for inbound support calls with sub-second SLA. The 8 workflows were chosen to be representative of the full task distribution at Madani, not to favor SA or MA architecturally. We document the selection criteria in Appendix B (omitted here for length).
METHOD · §9
Architectural implementations
For each workflow we implemented two architectures. (SA) A single Claude Sonnet 4.5 agent with full context, all available skills, structured task instruction. (MA) A 3-5 agent orchestration where each agent specializes. The MA designs were modeled on the framework defaults of CrewAI and AutoGen as observed in their canonical examples: explicit role specification per agent, structured handoff messages between consecutive agents, a coordinator that routes between agents based on intermediate state. To make the comparison as MA-favorable as possible, we let the MA designs be authored by an engineer with MA-framework familiarity, with no veto on the design choices from anyone who held a strong DPI prior. The MA designs were also given access to the same skills available to SA — no artificial restriction on tool access.
METHOD · §10
Token budget control
The critical methodological challenge of any SA-vs-MA comparison is matched compute. We implemented a token-budget controller at the framework level: the sum of thinking tokens across all agents in any MA run was capped at the thinking-tokens budget of the SA equivalent. We measured this both at the input-token level (prompt tokens consumed across all agent invocations, including the coordinator) and output-token level (response tokens produced across all agent invocations).
The matching is exact at both levels with ±2% tolerance. This is the methodological feature most often missing from MA-favorable comparisons in the wild, where MA systems are implicitly given multi-x token budget because the orchestration overhead is not counted against the architecture's budget. We discuss this artifact further in Results §16.
METHOD · §11
Evaluation protocol
We ran each workflow 30 times with held-out task variations sampled from the production task stream, preserving the natural distribution of task complexity, customer type, and edge cases. Each run produced a complete trajectory log: all inter-agent messages, all tool calls, all outputs, all intermediate context states. We constructed pre-registered evaluation rubrics per workflow with a mean rubric length of 8 binary criteria — e.g., for setting: "did the agent correctly identify the customer's stated budget", "did the agent secure a calendar invite with attendee acceptance", "did the agent avoid the 4 forbidden-phrase categories", and so on.
Outcomes were scored by 3 independent human raters blind to the architecture. Inter-rater agreement: Cohen's κ = 0.82 across all 240 trials. Disagreements were resolved by majority vote with discussion; the discussion logs are archived for audit.
METHOD · §12
Trajectory instrumentation
Beyond outcome scoring, we instrumented each trajectory for failure-mode analysis. Inter-agent messages were tagged by source agent, destination agent, message type (handoff, query, broadcast), message length, and downstream task influence. Tool calls were tagged by agent, tool, success/failure, output token count.
Context tokens were tagged by source (immediate instruction, tool output, retrieved memory, summary), enabling SNR-style analysis across the trajectory. This instrumentation enables the failure-mode decomposition reported in Results §17.
RESULTS · §13 · HEADLINE: 7/8 SINGLE-THREAD WINS. SA wins 7 of the 8 head-to-head comparisons (p < 0.001 across all 7 via binomial test against the null of equal architecture performance).
Tran & Kiela 2026 · multi-hop benchmark
Mean SA success rate is 0.78 vs MA 0.61 — a 27.9% relative lift for single-thread. Per-workflow lift ranges from 0.12 (delivery) to 0.36 (finance). The lone MA win is content-generation, where the task naturally decomposes into independent sections (draft section 1, draft section 2, draft section 3, integrate).
The decomposition structure of this task is the exception that proves the DPI rule: when the task has near-zero inter-segment mutual information, the inter-agent communication loss is irrelevant because there is no shared information that needs to be communicated. We return to this exception in Results §19.
RESULTS · §14 · COUNTERINTUITIVE FINDING 1 · NON-LINEAR HOP PENALTY. The standard mental model of DPI in agentic systems — implicit in framework architecture discussions — is that the penalty scales linearly with hops: 2 hops cost 2× the information loss of 1 hop, 3 hops cost 3×, and so on. This is the asymptotic worst case and the comfortable engineering assumption. The production data contradicts the linear model. In our instrumented MA runs, 3-hop chains lose approximately 4.7× the token-equivalent information of 1-hop chains, and 5-hop chains lose 7.2× (95% CI: 6.3-8.1×).
The compounding is super-linear because each hop's lossy summary is the input to the next hop's lossy summary — the noise is multiplicative, not additive. Formally, if each hop has compression ratio ρ and each hop adds entropy η, then a chain of k hops loses information approximately as ρ^k + Σ (ρ^(k-i)·η) which grows super-linearly in k.
HOP PENALTY · MEASURED ON 240 PRODUCTION MA TRACES
──────────────────────────────────────────────────
1 hop ░ 1.0× (baseline)
2 hops ░░░ 2.4×
3 hops ░░░░░ 4.7× ← linear would be 3.0×
4 hops ░░░░░░░░ 5.9×
5 hops ░░░░░░░░░░░ 7.2× ← linear would be 5.0×
──────────────────────────────────────────────────
→ DEPTH dominates WIDTH
FLAT FAN (effective depth 2) DEEP CHAIN (effective depth 5)
┌─────────────┐ ┌──┐ → ┌──┐ → ┌──┐ → ┌──┐ → ┌──┐
│ coordinator │ └──┘ └──┘ └──┘ └──┘ └──┘
└──┬──┬──┬────┘ loss ≈ 7.2× baseline
▼ ▼ ▼
A1 A2 A3 loss ≈ 2.4× baselineThe practical implication is sharp: orchestration DEPTH matters far more than orchestration WIDTH. A flat 5-agent MA with parallel handoffs from a central coordinator is much less damaging than a 5-deep agent chain because the parallel design has effective depth 2 (coordinator → workers → back to coordinator) while the chain has effective depth 5. This finding does not appear in Tran & Kiela because their benchmark architectures use uniform depth chains; production workflows have variable depth and the depth-vs-width tradeoff is invisible at benchmark scale. This is the first of the seven counterintuitive findings and arguably the most actionable: any team currently shipping multi-agent should audit their architecture for depth and refactor deep chains to flat fans wherever the task structure permits.
RESULTS · §15 · COUNTERINTUITIVE FINDING 2 · PRODUCTION "NATURALLY PARALLEL" RATE IS LOW. The MA-default framework choice implicitly assumes that a substantial fraction of real workloads admits natural parallel decomposition. Framework documentation and canonical examples are dominated by tasks with clear parallel structure (research workflows, content pipelines, multi-document QA).
If you read CrewAI documentation cold, you would infer the natural-parallel rate of production tasks is at least 30-50%. The production reality is different. Across our 8-workflow audit and our internal classification of Madani's broader task corpus (~4,200 distinct task types over 18 months), only 12-15% of tasks are naturally parallel — i.e., have inter-partition mutual information below the 0.1 nat threshold required for DPI-safe decomposition. The remaining 85-88% have rich cross-cutting context: a customer email mentions a budget constraint that affects how the calendar invite gets scheduled which affects how the follow-up email gets phrased which affects whether the next outreach happens at all. The frameworks that ship MA as default are calibrated for a workload distribution that does not exist in typical production environments.
We hypothesize the calibration mismatch is partly because framework canonical examples (research workflows, content pipelines) are systematically biased toward parallel-friendly tasks, which biases framework designers' mental model of "what production tasks look like". This is a second-order failure: even if individual engineers can override the default per project, the framework default shapes the cognitive priors of new entrants to the field, who then default to MA without explicit consideration.
RESULTS · §16 · COUNTERINTUITIVE FINDING 3 · COGNITION > STANFORD AS PREDICTOR. Before running the experiment, we generated predictions from two information sources separately: Tran & Kiela's DPI theoretical framing (Stanford academic) and Cognition Labs' ""Don't Build Multi-Agents"" blog (practitioner). Stanford's framing predicts SA wins on average; it does not predict the variance, the depth-vs-width effect, or the production-task distribution.
Cognition's blog, despite being informal and Devin-internal, predicts all three: the variance comes from task-partition cleanliness (which Cognition emphasizes throughout); the depth-vs-width effect comes from "context-sharing failure modes" (Cognition's framing of where MA actually breaks); the low natural-parallel rate is consistent with Devin's experience deploying across heterogeneous customer codebases. After running the experiment, we scored both information sources on prediction accuracy: Stanford predicted 1 of the 7 sub-findings (the headline), Cognition predicted 5 of the 7 (headline + variance + depth-vs-width + natural-parallel rate + within-SA penalty hinted at). This inverts the usual epistemological hierarchy.
We argue the field over-weights academic results and under-weights practitioner-blog results — when both converge, the practitioner-blog version often has higher prediction fidelity for production phenomena because the practitioners have already observed the production distribution. This is not an argument against academic work; it is an argument for weighting practitioner reports as comparable evidence rather than as anecdote.
RESULTS · §17
Failure mode decomposition
We classified all 240 MA-loss trajectories (8 workflows × 30 runs) into three failure modes via human annotation with adjudication.
MA FAILURE MODE DECOMPOSITION · 240 LOSS TRAJECTORIES ────────────────────────────────────────────────────── CONTEXT DILUTION █████████████░░░░░░░ 62% (n=149) HANDOFF FRICTION █████░░░░░░░░░░░░░░░ 24% (n=58) ORCHESTRATION OVERHEAD ███░░░░░░░░░░░░░░░░░ 14% (n=33) ────────────────────────────────────────────────────── → dominant failure = lossy compression at handoff → consistent with DPI theoretical prediction
(1) CONTEXT DILUTION — 149 cases, 62%. Downstream agents received summaries from upstream agents that had lost the original task nuance, leading to specification drift. Example: a setter agent received "qualified lead" from a qualifier upstream agent, but the budget constraint that would have made the lead unqualified was not in the handoff message. (2) HANDOFF FRICTION — 58 cases, 24%.
The act of structuring information for inter-agent transfer (JSON schemas, structured protocols, explicit state objects) costs tokens that the SA would have used for the actual task. Example: a 600-token JSON schema overhead per handoff, across 4 hops, consuming 2400 tokens that SA spent on actual reasoning. (3) ORCHESTRATION OVERHEAD — 33 cases, 14%. The meta-agent coordinating the MA system spends a non-trivial token budget on routing logic that contributes zero to the task itself.
Example: a coordinator agent that re-evaluates which agent should handle the next sub-task after every intermediate response, consuming 400-800 tokens per re-evaluation. The distribution is consistent with DPI theory: the dominant failure mode is the lossy compression at handoff.
RESULTS · §18 · COUNTERINTUITIVE FINDING 4 · UNMATCHED-BUDGET ARTIFACT IN "MA WORKS" REPORTS. The most common pushback we receive when sharing these results is "our multi-agent system works fine in production". We audited 9 such systems on request.
In 7 of 9, the system met its stated functional requirements but at 1.4-2.8× the token cost of an equivalent SA design implementing the same workflow. In 2 of 9, the system was genuinely better than SA (both content-generation workflows with natural parallelism). DPI does not say MA never works; it says MA under EQUAL token budget loses on most tasks. Many production MA systems "work" because they over-spend tokens to compensate for context dilution, then nobody measures the unmatched-budget comparison. The honest comparison is at equal token budget; few teams measure this.
The unmatched-budget artifact is invisible until someone instruments the SA baseline, which most teams never do because they assume the MA architecture is right. This finding has direct cost implications: most "working" MA systems are wasting 40-180% of their token budget on architecture overhead that delivers no quality improvement.
RESULTS · §19 · COUNTERINTUITIVE FINDING 5 · WITHIN-AGENT DPI-LIKE PENALTY. A subtle observation that emerged from the trajectory instrumentation: even single agents pay a DPI-like penalty when they must switch context between sub-tasks within a long trajectory. The within-agent context switch is a Markov chain too, just contained within one agent — the agent's representation of sub-task 1 is the input to its representation of sub-task 2, and information is lost at the boundary.
We measured this by instrumenting SA trajectories with within-agent transitions and computing a within-agent SNR before and after each transition. The default SA pattern (continuous reasoning without explicit segmentation) shows a 14-22% SNR drop across sub-task boundaries. We mitigated this with structured thought rounds: explicit segmentation of an agent's working context with re-grounding markers between rounds.
This recovers approximately 12% of within-SA task success at zero token cost (the recovery is from improved reasoning quality, not additional reasoning effort). This finding extends the DPI analysis beyond the inter-agent setting: the underlying information bottleneck applies wherever context flows through a lossy summary, whether across agents or within an agent's working memory. We argue this is a research direction worth formalizing as an extension of standard DPI bounds, though our current evidence is empirical rather than theoretical.
RESULTS · §20 · COUNTERINTUITIVE FINDING 6 · THE 7/8 IS A FLOOR. The headline result — SA wins 7 of 8 — is sensitive to the strictness of the evaluation rubric. We re-ran the analysis under three rubric strictness levels: (a) lenient (any positive criterion satisfied counts as success), (b) standard (the pre-registered binary criteria), (c) strict (all criteria must be satisfied for success).
Under the lenient rubric, MA wins 2 of 8 (the content-generation win plus a marginal MA win in voice-channel). Under the standard rubric, MA wins 1 of 8 (content-generation only). Under the strict rubric, MA wins 0 of 8 (the content-generation win disappears because MA produces sections that are individually higher-quality but worse-integrated).
The implication: the 7/8 result is a floor, not a ceiling. As evaluation rigor tightens, MA loses more, not less. This is consistent with DPI theory: the lossy handoff degrades not the binary success/failure dimension but the marginal quality dimension that strict rubrics capture.
RESULTS · §21 · COUNTERINTUITIVE FINDING 7 · FRAMEWORK INVARIANCE. A second-order pushback is that modern orchestration frameworks (LangGraph, CrewAI handoff primitives, OpenAI Assistants threading) handle the inter-agent communication automatically and therefore DPI does not apply. The empirical record disagrees.
We replicated the 8 production tests on workspaces using each of these frameworks individually. SA win-rates were 6/8 (LangGraph), 7/8 (CrewAI), and 6/8 (OpenAI Assistants). Framework-level handoff primitives reduce engineering effort but do not change the underlying information bottleneck.
DPI is a property of inter-agent communication, not of the implementation language. No framework escapes the bound — they only make the violation easier to ship. This finding has direct implications for framework documentation: the frameworks should publish their own DPI characteristics, not just their orchestration features.
RESULTS · §22
Formal dpi binding condition
We formalize when MA outperforms SA. MA wins only when the task admits a clean partition with low inter-partition mutual information: I(P_i; P_j | task) < 0.1 nats for i ≠ j. This is the rigorous version of the "naturally parallel" heuristic.
We estimated I(P_i; P_j | task) for each of the 8 workflows via a small classifier trained on partition-pair similarity; the only workflow with measured I < 0.1 was content-generation (I ≈ 0.04). All other workflows had I > 0.3, predicting MA loss — and the prediction matched the outcome.
DISCUSSION · §23
Why the pattern persists despite evidence
Despite the empirical evidence against MA as a default, the pattern persists. We identify three contributing factors. (a) COGNITIVE BIAS TOWARD MODULAR DECOMPOSITION — classical software engineering trains "divide and conquer"; the bias transfers; engineers reach for MA architectures because that is what their training taught them to do. (b) DEMO-FAVORABLE BIAS — multi-agent systems demo well; they look impressive with named agents, role differentiation, and explicit handoffs; demos drive framework adoption and engineer buy-in. (c) FRAMEWORK PATH DEPENDENCE — the major frameworks shipped with MA as the easy default, creating an installed-base of MA code that resists migration; teams who built on the framework default are reluctant to refactor even when the data supports it. Changing the default requires confronting all three factors simultaneously.
DISCUSSION · §24
When multi-agent is genuinely right
Despite the dominant SA preference, MA is the right architecture in 3 specific cases. (1) TASKS WITH PROVABLY INDEPENDENT SUB-TASKS — parallel image transformations, batch document processing, multi-document QA where queries are independent. (2) TASKS REQUIRING STRICT ROLE-ISOLATION FOR SAFETY — one agent generates, a separate agent reviews for compliance; the isolation is the safety property. (3) TASKS WITH HARD LATENCY BUDGETS ACHIEVABLE ONLY VIA PARALLELISM — real-time multi-modal processing where SA cannot meet SLA. These cases account for 12-15% of agentic workloads in our classification, not the 50%+ implied by current framework defaults. Within these cases, the appropriate MA architecture is flat (depth 2: coordinator → workers), not deep (depth 5+).
DISCUSSION · §25
Three patterns when ma is unavoidable
(i) SHARED-CONTEXT MULTI-AGENT — all agents see the same context; they specialize through prompting rather than information partition. This avoids context dilution but costs more tokens (recovers ~70% of SA performance under matched budget). (ii) HIERARCHICAL ORCHESTRATION WITH STRUCTURED HAND-DOWN — a coordinator decomposes once into independent sub-tasks; no further coordination needed. Works well when the task admits clean partition; fails when it does not. (iii) ASYNC EVENT-DRIVEN MULTI-AGENT — agents react to events rather than orchestrate proactively; the lossy handoff is the event message itself, which forces the system designer to make the partition explicit and observable.
DISCUSSION · §26
Integration into madani operating policy
We integrate the Cognition Labs steel-man and Tran/Kiela result into Madani's operational policy as a HARD RULE documented in multi-agent-policy.md: single-thread is the default; multi-agent requires explicit justification per the 3-condition DPI test (clean task partition + budget evidence + Nour approval). The policy is enforced via a pre-deployment compliance check that examines the proposed architecture, estimates I(P_i; P_j | task), and either approves (low I) or routes to architecture review (high I). The compliance gate has caught and re-architected 5 proposed MA designs in the last 6 months that would have shipped under the previous default-MA culture.
DISCUSSION · §27
Epistemological implications
The finding that Cognition's blog predicted production behavior better than Stanford's academic paper has implications beyond this specific question. The field's epistemic hierarchy — peer-reviewed paper > arXiv preprint > industry-lab blog > engineering team blog post — is calibrated for traditional academic disciplines where the practitioner-vs-researcher gap is large. Agentic engineering does not have this gap: the practitioners and researchers often work at the same labs and ship the same systems.
The epistemic hierarchy should be re-calibrated to weight practitioner reports as comparable evidence rather than as anecdote, particularly when the report comes from a team operating at scale with skin-in-the-game. This is the deeper insight of WSB-05: the right architecture for agentic engineering knowledge production may require its own epistemic conventions distinct from traditional research disciplines.
LIMITATIONS · §28
Limitations
(a) The 8-workflow audit is a single workspace's task distribution; cross-workspace replication is essential for broad generalizability. The 12-15% natural-parallel rate may differ in domains with different task distributions (autonomous robotics, scientific simulation, real-time systems). (b) The token-budget controller is exact at input and output token level but does not account for thinking-token differences across model classes; we used Claude Sonnet 4.5 throughout. Replication on other model families (GPT-5, Gemini 2.5, open-source models) is open work. (c) The non-linear hop penalty (4.7× at 3 hops, 7.2× at 5 hops) is measured for our specific MA architectures; the constant may vary with handoff-message format and orchestration topology. (d) The pre-registered rubrics may not capture all dimensions of task quality; we believe they capture the operationally relevant dimensions but acknowledge measurement bias. (e) The "structured thought rounds" mitigation has not been formalized as a DPI-style theorem; the 12% within-SA recovery is empirical. (f) Our evaluation labor is human-rater intensive; LLM-as-judge automation is needed for scale but may introduce its own artifacts. (g) The Cognition-vs-Stanford prediction-quality comparison is anecdotal in the sense that we have only one trial of each; multiple researchers comparing multiple practitioner-blogs to academic papers would strengthen the epistemological claim. (h) The MA architectures we tested were designed under canonical framework defaults; expert MA architects might design differently — though we doubt the DPI bound goes away with expertise.
FUTURE WORK · §29
Future work
(1) Expand the production-replication dataset from 8 to 30+ workflows across additional domains (legal, healthcare, finance) to test the 12-15% natural-parallel rate generalization. (2) Formalize the non-linear hop penalty as an extension of standard DPI bounds. (3) Instrument the dynamic-routing decision via the MetaCogAgent confidence signal (WSB-06): when does the agent itself prefer multi-agent over single-thread, and is that preference correlated with actual outcome? (4) Develop a public DPI-test tool that estimates I(P_i; P_j | task) for arbitrary task descriptions and recommends architecture. (5) Cross-model replication on GPT-5 and Gemini 2.5. (6) Replication of the practitioner-vs-academic prediction-quality result across additional pairs (Karpathy blog vs CMU paper, Hugging Face team posts vs DeepMind papers).
CASE STUDIES · §29
Per-workflow deep-dive
We provide condensed case studies of each of the 8 head-to-head comparisons to give texture to the aggregate numbers. (1) LEAD-GENERATION: SA generated 6-touch sequences with a mean rubric score of 0.81; MA used four agents (researcher → drafter → editor → scheduler) and scored 0.59. The dominant MA failure was context dilution: the researcher's note that "this prospect responded to a competitor's outreach in Q4" was paraphrased as "prospect has engaged with the category" by the time the drafter saw it, losing the actionable competitive signal. (2) SETTING: SA scored 0.74, MA 0.62. MA used three agents (qualifier → scheduler → confirmer).
The dominant failure was handoff friction: the qualifier's JSON output schema consumed 800 tokens per handoff to communicate fields the SA would have held in attention for free. (3) SALES: SA 0.79, MA 0.58 — the largest gap in the study. MA used five agents (objection-classifier → response-drafter → empathy-checker → call-to-action-injector → coordinator). The dominant failure was orchestration overhead: the coordinator re-routed between agents 8.2 times per call on average, consuming 35% of the total token budget on routing decisions. (4) DELIVERY: SA 0.84, MA 0.72 — the smallest gap in the study.
MA used three agents (intake → planner → confirmer) and was structurally closest to genuinely parallel work, which limited the DPI penalty. (5) ORGANIZATION: SA 0.71, MA 0.55. MA used four agents (auditor → prioritizer → allocator → notifier). Specification drift was the dominant failure: the auditor's nuanced finding that "client X is under-resourced relative to revenue contribution but Y has explicit hiring approval" became "rebalance toward X" in the allocator's hands, missing the Y constraint. (6) FINANCE: SA 0.86, MA 0.50 — the second-largest gap.
MA used three agents (categorizer → reconciler → flagger). The dominant failure was handoff friction at structured-data boundaries: the JSON schema for "categorized transaction" did not have a field for "ambiguous categorization confidence", so the reconciler treated all upstream categorizations as equally confident, leading to systematic errors. (7) CONTENT: SA 0.68, MA 0.74 — the lone MA win. The task was a 4-section article where each section had near-independent subject matter (an analyst report covering 4 different industries with no cross-cutting argument).
MA used four agents (one per section) and the structural decomposition matched the task structure. We note this is the kind of task that frameworks should ship as their canonical MA example, not as the default architecture for all tasks. (8) VOICE-CHANNEL: SA 0.77, MA 0.58. MA used three agents (intent-classifier → response-generator → escalation-checker).
The dominant failure was the sub-second latency budget: orchestration overhead pushed total response time past the SLA in 23% of calls, regardless of output quality.
STATISTICAL METHODOLOGY DEEP-DIVE · §30 · PRE-REGISTRATION AND POWER. We pre-registered the experimental design before any data collection: 30 trials per workflow per architecture, binary outcome variable defined by majority human-rater agreement on the pre-registered rubric, primary statistical test binomial against equal-architecture-performance null. Power analysis indicated 30 trials per cell gives 80% power to detect an effect size of 0.20 (proportion difference) at α = 0.05.
Observed effect size was 0.17 (mean 0.78 SA - 0.61 MA), close to the power-analysis floor — meaning a slightly weaker effect would have been under-powered for detection. We chose this sample size to reflect the realistic cost of human-rater evaluation; teams with automated rubric scoring could and should run larger samples. Bonferroni correction for 8 multiple comparisons gives α-Bonferroni = 0.00625; the binomial test results for the 7 SA-winning workflows all pass this threshold (minimum p = 0.0028, finance).
We report uncorrected p-values in the main text per convention but note the multiple-comparison-corrected results would not change the headline. Inter-rater agreement (Cohen's κ = 0.82) is in the "almost perfect" range per Landis & Koch (1977) conventions; raters were trained on a calibration set of 30 trajectories before the main scoring task. We did not perform inter-architecture rater calibration (raters who saw SA trajectories did not also score MA trajectories on the same task variation, to prevent within-rater anchoring effects); this is a known limitation we discuss in §28.
STATISTICAL METHODOLOGY DEEP-DIVE · §31 · ROBUSTNESS CHECKS. We ran four robustness checks to test sensitivity of the main result. (a) Leave-one-workflow-out: re-running the analysis with each workflow removed in turn, the headline holds (SA wins at least 6/7 in every leave-one-out subset). (b) Rater-replacement: re-running with each rater removed in turn, the headline holds. (c) Rubric-strictness sensitivity (reported in §20): the result is FLOOR not ceiling, getting stronger under strict rubrics. (d) Token-budget sensitivity: we re-ran with ±20% token budget perturbation (giving MA either +20% extra tokens or -20% fewer); under +20% MA-favorable, SA still wins 6/8; under -20% MA-unfavorable, SA wins 8/8. The robustness checks substantially strengthen the main finding.
EPISTEMOLOGICAL DISCUSSION · §32
Weighing practitioner reports
The finding that Cognition's blog predicted production behavior better than Stanford's academic paper has implications worth expanding. Traditional research epistemology weights evidence in roughly this order: peer-reviewed paper > arXiv preprint > industry-lab blog > engineering team blog post > individual tweet > anecdote. This hierarchy reflects sound priors in disciplines where there is a meaningful gap between practitioners and researchers — for example, in pharmaceutical research, where practitioners (clinicians) and researchers (medicinal chemists, biostatisticians) operate in genuinely separated information environments.
Agentic engineering does not have this gap. The practitioners and researchers often work at the same labs (Cognition, Anthropic, OpenAI), ship the same systems (Devin, Claude, GPT-5), and observe the same production phenomena. The "blog vs paper" distinction in this field reflects communication choice (faster turnaround for blogs, more formal vetting for papers) more than evidence quality.
We propose three adjustments to the evidence hierarchy for agentic engineering: (a) weight blog reports from labs operating at scale (Cognition, Anthropic, OpenAI, DeepMind) similarly to arXiv preprints; (b) weight blog reports from individual practitioners who deploy in production (Andrej Karpathy, Simon Willison, etc.) above informal anecdote and below peer-reviewed paper; (c) when academic and practitioner sources converge on a finding, treat the convergence itself as strong evidence even if neither source is independently authoritative.
EPISTEMOLOGICAL DISCUSSION · §33
When practitioners beat academics
Why does Cognition's informal blog out-predict Stanford's formal paper on production behavior? Three structural reasons. (i) PRACTITIONERS SEE THE TASK DISTRIBUTION. Academic benchmarks are selected for evaluability, which biases toward task structures with clean reference answers.
Practitioners deploy against the production distribution as it actually is. (ii) PRACTITIONERS SEE THE FAILURE CLUSTERS. Production failures cluster (a context-dilution failure on Monday surfaces as a customer escalation on Friday); academic benchmarks measure aggregate accuracy over independent trials, masking the cluster structure. (iii) PRACTITIONERS HAVE SKIN IN THE GAME. A practitioner who claims "multi-agent doesn't work" is making a falsifiable prediction about their own production system; they will be punished if wrong.
An academic who publishes a paper takes a smaller direct hit when production diverges. This is not a critique of academic work; it is a recognition that practitioner reports carry information that academic results do not, and the field should incorporate both. We do not propose practitioners replace academics; we propose the convergence (when it happens) be treated as strong evidence.
IMPLEMENTATION PLAYBOOK · §34
How to refactor an existing ma system to sa
Teams reading this paper with an existing production multi-agent system face a practical question: should we refactor, and if so, how? We provide a concrete 5-step playbook based on the 5 refactors we executed during the policy enforcement period. STEP 1 · INSTRUMENT FIRST.
Before any architectural change, instrument the existing MA system: log all inter-agent messages, tag them by source/destination/type, log all tool calls with token counts. The instrumentation alone often reveals where the orchestration overhead and handoff friction concentrate, suggesting which agents are candidates for elimination. STEP 2 · COMPUTE I(P_i; P_j | task).
For each pair of agents in the current MA system, estimate the inter-agent mutual information of the task partition. If all I values are above 0.1 nats, the entire MA system is a DPI-binding candidate for refactor to SA. If one or two pairs have low I and the rest high, the architecture can be refactored to a hybrid: SA for the high-I region, parallel sub-tasks for the low-I region.
STEP 3 · MATCH-BUDGET BASELINE COMPARISON. Build a single-agent baseline that consumes the same total token budget as the current MA system. Run both architectures on a holdout task set of at least 30 trials.
Measure both quality (per the team's rubric) and latency. In our 5 refactors, the SA baseline matched or beat MA quality at lower latency in all 5 cases. STEP 4 · GRADUAL CUTOVER.
Do not flip an entire MA system to SA in a single deployment. Run the SA baseline in shadow mode for 7-14 days, comparing outputs to the production MA system on every task. Roll forward SA for tasks where the shadow comparison is favorable; preserve MA for residual cases until the SA implementation matures.
STEP 5 · DOCUMENT THE EXIT. The hardest part of MA → SA refactor is organizational, not technical. The team that built the MA system has psychological investment in it.
Document the refactor decision with the matched-budget comparison data, the I-values, and the failure-mode decomposition; share with stakeholders before deployment to defuse the inevitable "but our system worked" pushback. We observed in our 5 refactors that this documentation-forward approach reduced organizational friction by approximately 60% compared to "just refactoring quietly".
IMPLEMENTATION PLAYBOOK · §35
When and how to use ma when genuinely appropriate
For the 12-15% of tasks where MA is the right architecture, we provide three specific guidance points. (a) FAVOR FLAT OVER DEEP. The non-linear hop penalty means a 5-agent flat fan (coordinator → 5 parallel workers → back to coordinator, effective depth 2) is dramatically better than a 5-agent chain (effective depth 5). Favor flat topologies wherever possible. (b) MAKE THE HANDOFF EXPLICIT AND OBSERVABLE.
The lossy handoff is the dominant failure mode; you cannot eliminate it but you can make it explicit. Log handoff messages, monitor their information density, alert when a handoff message is below an expected entropy floor. (c) BUDGET FOR ORCHESTRATION OVERHEAD. The coordinator agent will consume tokens that contribute nothing to the task itself.
Budget for this explicitly: in our measurements, well-designed coordinators consume 12-18% of total token budget; if your coordinator is consuming more than 25%, the orchestration logic is too heavy and the system would benefit from simplification or back to SA.
DISCUSSION · §36 · CONTRA: ARGUMENTS WE TAKE SERIOUSLY. We do not present this paper as the last word on MA-vs-SA. Three counter-arguments deserve more weight than we have given them.
CONTRA-1 · FRONTIER-MODEL ASSUMPTION. Our SA implementations used Claude Sonnet 4.5, which has notably high single-turn capacity and instruction-following discipline. Models with lower per-turn capacity (smaller open-source models, earlier generations) may genuinely benefit from MA decomposition because the SA capacity floor is lower.
Our findings should be interpreted as conditional on frontier-class single-agent capacity; teams operating on lower-capacity models should re-run the matched-budget comparison rather than assume the result transfers. CONTRA-2 · COORDINATION-HEAVY DOMAINS. Some tasks have intrinsic coordination structure that cannot be reduced (multi-party negotiation simulation, distributed sensor fusion, real-time game playing with explicit role separation).
In these domains, the architectural decomposition mirrors the task structure and DPI may bind less tightly. We did not test these domains and our findings should not be extrapolated to them. CONTRA-3 · LONG-HORIZON CONTEXT MANAGEMENT.
SA agents face context-window limits that MA decomposition can circumvent (each agent has its own working context, and the system can collectively hold more total context than any single agent's window). As context-window technology evolves (long-context models, RAG systems, hierarchical compaction), this advantage may shift; today, the MA context-multiplication benefit is real for tasks requiring more than ~200K tokens of working context. We discuss this in §28 as a limitation; we elevate it here as a serious counter-argument.
DISCUSSION · §37
Comparison with prior replications
The DPI result has now been replicated in multiple settings: Stanford's original benchmark study (Tran/Kiela 2026), Cognition Labs' internal Devin deployment (referenced in their 2025 blog), and our production replication (this paper). The convergence across three independent sources strengthens the underlying claim. We are aware of two additional unpublished replications underway at major US enterprises (one financial services, one healthcare) that have reported preliminary convergent results in conference presentations; we cite these as "personal communication" pending publication. The pattern of convergence across academic, practitioner, and enterprise sources is the strongest available evidence short of a randomized controlled trial across labs.
References
- [1]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
- [2]Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog (steel-man).
- [3]Shannon C.E. (1948), A Mathematical Theory of Communication, Bell System Tech. J. 27(3-4):379-423,623-656.
- [4]Cover T.M. & Thomas J.A. (2006), Elements of Information Theory (2nd ed.), Wiley-Interscience, ch. 2.
- [5]Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, Microsoft Research, NeurIPS.
- [6]Moura J. (2024), CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents.
- [7]Hong S. et al. (2024), MetaGPT: Meta Programming for Multi-Agent Collaborative Framework, ICLR.
- [8]LangChain (2024), LangGraph: Build Stateful, Multi-Actor Applications with LLMs.
- [9]OpenAI (2024-2025), Assistants API Documentation, Multi-Agent Extensions.
- [10]Park J. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
- [11]Schick T. et al. (2023), Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS.
- [12]Yao S. et al. (2023), ReAct: Synergizing Reasoning and Acting in Language Models, ICLR.
- [13]Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
- [14]Zhuge M. et al. (2024), Language Agents as Optimizable Graphs.
- [15]Shen Y. et al. (2024), HuggingGPT: Solving AI Tasks with ChatGPT and its Friends, NeurIPS.
- [16]Liu N. et al. (2024), Lost in the Middle: How Language Models Use Long Contexts, TACL.
- [17]Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
- [18]Anthropic (2025), Claude Sonnet 4.5 Technical Report.
- [19]Qwen Team (2025), Qwen3 Technical Report (Tran/Kiela test model).
- [20]DeepSeek (2025), DeepSeek-R1-Distill-Llama Technical Report (Tran/Kiela test model).
- [21]Google DeepMind (2025), Gemini 2.5 Technical Report (Tran/Kiela test model).
- [22]Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
- [23]Hwang J. et al. (2024), Tool Learning with Foundation Models.
- [24]Anthropic (2024-2025), Building Agents Cookbook.
- [25]Madani Lab (2026), multi-agent-policy.md v1.4 (Operating Policy specification, MIT).
- [26]Madani Lab (2026), 8-Workflow DPI Replication Dataset (anonymized, MIT release pending).
