Abstract
We adopt the MAST taxonomy (Multi-Agent System failure Taxonomy) proposed by Cemri, Pan, Yang, Agrawal, Chopra, Tiwari, Keutzer, Parameswaran, Klein, Ramchandran, Zaharia, Gonzalez, and Stoica (arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track, UC Berkeley + Intesa Sanpaolo) for production reliability analysis at Madani Lab. The original Cemri et al. work is the first empirically-grounded taxonomy of multi-agent LLM system (MAS) failures, derived via Grounded Theory analysis (Glaser 1967) of 150 traces across 5 popular MAS frameworks, validated with κ = 0.88 inter-annotator agreement across 3 expert annotators, and applied at scale to MAST-Data (1642 annotated execution traces from 7 MAS frameworks: ChatDev, MetaGPT, HyperAgent, AppWorld, AG2/MathChat, Magentic-One, OpenManus, run across GPT-4 series and Claude series models on coding, math, and general-agent tasks). The taxonomy comprises 14 fine-grained failure modes organized into 3 categories: FC1 System Design Issues (44.2% of all failures, modes FM-1.1 through FM-1.5), FC2 Inter-Agent Misalignment (32.3%, FM-2.1 through FM-2.6), FC3 Task Verification (23.5%, FM-3.1 through FM-3.3). ```ascii MAST · 14 FAILURE MODES · 3 CATEGORIES · BENCHMARK DISTRIBUTION ────────────────────────────────────────────────────────────── FC1 System Design Issues █████████░░░░░░░░░░░ 44.2% FM-1.1 Disobey task spec FM-1.2 Disobey role spec FM-1.3 Step repetition ← #1 single failure mode FM-1.4 Loss of conversation history FM-1.5 Unaware of termination FC2 Inter-Agent Misalignment ███████░░░░░░░░░░░░░ 32.3% FM-2.1 Conversation reset · 2.2 Fail to ask clarification FM-2.3 Task derailment · 2.4 Information withholding FM-2.5 Ignored input · 2.6 Reasoning-action mismatch ← #2 FC3 Task Verification █████░░░░░░░░░░░░░░░ 23.5% FM-3.1 Premature termination FM-3.2 No / incomplete verification FM-3.3 Incorrect verification ────────────────────────────────────────────────────────────── → 76.5% of failures are NOT model problems ``` Cemri et al.'s most consequential finding — buried in their Section 5.3 — is that 41% to 86.7% of state-of-the-art MAS fail on benchmark tasks, AND that failures stem fundamentally from system design rather than LLM capability. The paper also documents that even maximal intervention (workflow fixes + verification hardening on ChatDev) yields only +15.6% improvement, leaving ~80% of failures unfixed by surface-level remedies. This WSB paper extends Cemri et al. by applying MAST to the Madani production workspace audit, reporting how the 14 modes distribute across our specific task domains (lead-generation, setting, sales, delivery, organization, finance, content, voice-channel), and surfacing seven counterintuitive sub-findings
- (b)hallucination is deliberately excluded from the taxonomy · Cemri et al. explicitly write that "MAS failures can stem from fundamental limitations of current LLMs, such as hallucination or instruction following. However, in developing MAST, we focus on identifying failure patterns where improvements in system design, agent coordination, and verification can offer room to improve" — this re-frames the conversation from "make models better" to "make systems better"
- (g)framework-specific pathologies mean no universal remedy — AppWorld dominated by premature termination, OpenManus by step repetition, HyperAgent by step repetition + incorrect verification · each MAS architecture has its own pathology and requires bespoke intervention. The implication is that the field's reliability conversation — currently focused on hallucination, RAG, prompt engineering — is mis-allocated relative to where production failures actually concentrate
INTRODUCTION · §1
The empirical gap
Despite 3+ years of enthusiastic multi-agent LLM system (MAS) framework releases — AutoGen (Wu et al., ICML 2024), CrewAI (Moura, 2024), MetaGPT (Hong et al., ICLR 2024), LangGraph (LangChain, 2024), CAMEL (Li et al., NeurIPS 2023), AgentVerse (Chen et al., ICLR 2024) — production reliability of MAS remains an open question. The opening sentence of Cemri et al.'s NeurIPS 2025 paper makes the point bluntly
"Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail."— Wu et al.
Their analysis reveals 41% to 86.7% failure rates across 7 state-of-the-art MAS frameworks (ChatDev, MetaGPT, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus) on benchmark tasks. These are not exotic edge cases — they are the dominant deployed multi-agent platforms in the open-source ecosystem. The empirical gap between MAS marketing and MAS production is the central problem this paper extends to a non-benchmark setting.
INTRODUCTION · §2
Why a taxonomy is needed
The conventional reliability metric for code-generation agents is pass@k (pass at k attempts), popularized after the HumanEval benchmark (Chen et al., 2021). pass@k is conceptually simple, computationally cheap, and aggregates well — but it has a critical flaw: it conflates fundamentally different failure modes into a single binary success/failure signal. A pass@1 of 0.7 tells you the agent succeeds 70% of the time on first try, but says nothing about WHY the 30% failed. Was it a hallucinated API call, a misread of the task, an inter-agent handoff failure, a verification gap, a step repetition loop?
Each failure mode demands a different engineering fix, and aggregating them into a scalar collapses the diagnostic signal. Cemri et al. argue (and we agree) that a taxonomy is the prerequisite for principled improvement: you cannot fix what you cannot name.
INTRODUCTION · §3
Why grounded theory mattered
A methodologically unusual choice in Cemri et al.'s work: rather than start with a predefined failure schema and label observations, they used Grounded Theory (Glaser, 1967) — a qualitative-research method where categories EMERGE from data rather than being imposed on it. Three annotators iteratively analyzed 150 traces, used open coding to surface candidate failure types, refined definitions through constant comparative analysis, and continued until "theoretical saturation" (no new failure modes emerging). The 14 final modes are therefore empirically grounded in observed failures rather than theoretically derived from a priori reasoning. This methodological choice has consequences: the resulting categories don't fit neatly into existing reliability frameworks like SOC2 or ISO 42001, but they describe real failures rather than theoretical ones.
RELIABILITY · pass@k under MAST failure modes
────────────────────────────────────────────
pass@1 = P(correct on single attempt)
pass@k = 1 - (1 - pass@1)^k (iid retries)
single-agent vs multi-agent (3 workers)
┌─────────────┐ ┌─────────────────────┐
│ pass@1 = .68│ │ pass@1 = .49 │
│ pass@5 = .97│ │ pass@5 = .91 │
│ │ │ (retries amplify │
│ │ │ MAST coupling) │
└─────────────┘ └─────────────────────┘
┌────────────────────────────────────────┐
│ MAST failure modes are CORRELATED │
│ across retries · pass@k decays SLOWER │
│ than the i.i.d. model predicts │
└────────────────────────────────────────┘RELATED WORK · §4
Adjacent mas reliability efforts
Cemri et al. position MAST against three lines of prior work. Surveys of MAS challenges (multiple workshop summaries) provide high-level overviews but no fine-grained empirical grounding. Specific-capability benchmarks (Qin et al., ToolBench, 2023; Liu et al., empirical studies of LLM agents, 2024) measure aggregate performance but not failure decomposition.
Evaluation frameworks (AgentEval; AGDebugger by Zhang et al.) provide tools for failure analysis but no shared taxonomy. MAST fills the taxonomic gap.
RELATED WORK · §5 · PASS@K AS COMPLEMENT, NOT COMPETITOR. We are explicit: MAST does not replace pass@k. pass@k remains valuable as a single-scalar reliability summary that fits academic benchmark conventions. MAST is the complement: when pass@k is low, MAST tells you WHY. A reliability dashboard that ships pass@k alongside MAST mode distribution is significantly more actionable than either metric alone.
METHOD · §6
Adapting cemri et al.'s pipeline to madani
We adopted Cemri et al.'s open-source LLM-as-a-Judge annotator (their o1 few-shot variant, 94% accuracy and 0.77 Cohen's κ against expert human annotators per their Table 2) to label the Madani workspace's own production traces with MAST modes. The pipeline is: (a) collect MAS execution trace, (b) feed trace + MAST definitions + few-shot examples to o1 LLM annotator, (c) receive structured JSON output with FM labels per trace segment, (d) aggregate at the trace level for failure-mode distribution analysis. We selected this approach over pure human annotation because Cemri et al. validated its quality at scale (1642 traces annotated) and their open release at github.com/multi-agent-systems-failure-taxonomy/MAST plus the MAST-Data on HuggingFace makes the pipeline directly reusable.
METHOD · §7
Madani trace collection
We instrumented every Madani agent runtime to log structured failure events: timestamp, task ID, agent state, all inter-agent messages, all tool calls and outputs, observable side-effects, and the agent's own self-reported reason for stopping. We collected 3,800 production traces spanning 6 months of Madani operations across 8 task domains: lead-generation, setting, sales, delivery, organization, finance, content, voice-channel. Of these, 1,180 traces were classified as failed (success criteria not met per the operational rubric), and we applied the MAST taxonomy to this failure subset.
METHOD · §8
Validation against human annotation
We sampled 100 of the 1,180 failed traces for human expert annotation by 3 independent annotators using Cemri et al.'s MAST definitions. Inter-annotator agreement: Cohen's κ = 0.79 (substantial; slightly below Cemri et al.'s 0.88 on their dataset, attributable to higher domain heterogeneity in Madani's task mix vs their benchmark MAS). We compared the LLM annotator output against the human consensus labels: agreement of 0.81, consistent with Cemri et al.'s reported 0.77. The pipeline is therefore valid for Madani production traces, with the caveat that ~19-23% of LLM-annotator labels disagree with human consensus — high-stakes failure analysis should keep human-in-the-loop spot checking.
RESULTS · §9
Madani failure distribution
Applying MAST to the 1,180 failed traces produced the following category distribution.
BENCHMARK (Cemri et al.) vs MADANI PRODUCTION · CATEGORY MIX
────────────────────────────────────────────────────────────
BENCHMARK MADANI
FC1 System Design 44.2% ▓▓▓▓▓░░ 38.4% ▓▓▓▓░░░
FC2 Inter-Agent Misal. 32.3% ▓▓▓░░░░ 41.2% ▓▓▓▓▓░░ ← +9pp
FC3 Task Verification 23.5% ▓▓▓░░░░ 20.4% ▓▓░░░░░
────────────────────────────────────────────────────────────
→ Madani has LESS system-design failure, MORE inter-agent
→ consistent with cross-domain coordination heavy workloadReliability bench · 200 tasks / 6 frameworks
MADANI WORKSPACE: System Design 38.4%, Inter-Agent Misalignment 41.2%, Task Verification 20.4%. Our workspace has LESS system-design failure (by ~6pp) and MORE inter-agent misalignment (by ~9pp), consistent with our task mix being heavier on cross-domain coordination (sales handoff to delivery, setting handoff to sales) than the benchmark MAS's task mix (mostly coding and math). The category-level distribution does not transfer 1-1 from benchmark to production; the relative ranking does.
RESULTS · §10
Top madani failure modes
Within our 1,180 failed traces, the top 6 modes account for 81% of failures.
TOP 6 MADANI FAILURE MODES · 1,180 production failure traces ──────────────────────────────────────────────────────────── __FM-2.6__ Reasoning-action mismatch █████████░░░░░░░░░░░ 18.3% __FM-1.3__ Step repetition ███████░░░░░░░░░░░░░ 14.7% __FM-2.3__ Task derailment ██████░░░░░░░░░░░░░░ 11.8% __FM-1.1__ Disobey task spec █████░░░░░░░░░░░░░░░ 10.4% __FM-3.3__ Incorrect verification █████░░░░░░░░░░░░░░░ 9.8% __FM-3.2__ No / incomplete verif. ████░░░░░░░░░░░░░░░░ 8.5% ──────────────────────────────────────────────────────────── pure-comm failures rare: FM-2.4 (1.2%) · FM-2.5 (2.1%)
(1) FM-2.6 REASONING-ACTION MISMATCH — 18.3%. The agent reasons correctly but acts incorrectly (or vice versa). Example from sales: the agent identifies "customer wants pricing discount" correctly but the action is "send onboarding kit". (2) FM-1.3 STEP REPETITION — 14.7%.
The agent gets stuck in a loop doing the same operation twice or more. Example from finance: the agent attempts the same reconciliation match repeatedly without recognizing it had already succeeded. (3) FM-2.3 TASK DERAILMENT — 11.8%. The agent veers off the task spec mid-execution. (4) FM-1.1 DISOBEY TASK SPEC — 10.4%.
The agent does not follow the task spec despite acknowledging it. (5) FM-3.3 INCORRECT VERIFICATION — 9.8%. The verifier passes outputs that do not meet criteria. (6) FM-3.2 NO OR INCOMPLETE VERIFICATION — 8.5%. Verifier skips checks entirely or applies superficial ones.
Pure communication failures (FM-2.4 Information Withholding 1.2%, FM-2.5 Ignored Input 2.1%) are surprisingly minor.
RESULTS · §11 · COUNTERINTUITIVE FINDING 1 · 78.7% IS NOT A MODEL PROBLEM. Both Cemri et al.'s benchmark distribution (44.2% + 32.3% = 76.5% system + coordination) and our production distribution (38.4% + 41.2% = 79.6%) put the bulk of failures in categories that ARE NOT model intelligence problems. This is the most consequential implication of the taxonomy: investing in the next frontier model produces diminishing returns on the 78.7% of failures that are architectural. Investing in workspace architecture, agent coordination, and verification produces compounding returns on the same 78.7%. The field's reliability conversation — currently focused on hallucination remediation, RAG retrieval, prompt engineering — is mis-allocated relative to where production failures actually concentrate.
RESULTS · §12 · COUNTERINTUITIVE FINDING 2 · HALLUCINATION IS EXCLUDED BY DESIGN. Cemri et al.'s Section 4 explicitly writes: "MAS failures can stem from fundamental limitations of current LLMs, such as hallucination or instruction following. However, in developing MAST, we focus on identifying failure patterns where improvements in system design, agent coordination, and verification can offer room to improve the reliability of MAS, often independently of or complementary to advancements in the base models themselves." The taxonomy has NO hallucination category. This is a deliberate scoping choice that reframes the conversation away from "make models better" toward "make systems better." Our Madani replication confirms this scoping is operationally correct: pure hallucination events were a minor failure pattern (<3% of failed traces). Teams that frame their reliability roadmap around hallucination reduction are addressing the wrong problem at the wrong magnitude.
RESULTS · §13 · COUNTERINTUITIVE FINDING 3 · STEP REPETITION IS #1 IN BENCHMARK, #2 IN PRODUCTION. In Cemri et al.'s benchmark distribution, FM-1.3 Step Repetition is the single largest mode at 15.7%. In our production distribution it is #2 at 14.7%, narrowly edged by FM-2.6 Reasoning-Action Mismatch at 18.3%. Step repetition is not a glamorous failure — agents getting stuck doing the same thing twice.
The fact that this boring failure mode dominates production indicates the field's attention is mis-allocated to more exciting but rarer failure types. The remedy (loop detection, step-state tracking, max-iteration enforcement) is straightforward in principle but rarely implemented in MAS frameworks by default.
RESULTS · §14 · COUNTERINTUITIVE FINDING 4 · REASONING-ACTION MISMATCH IS UNDERDISCUSSED. In our production data FM-2.6 Reasoning-Action Mismatch is #1 at 18.3%; in Cemri et al.'s benchmark data it is #2 at 13.2%. Yet adjacent literature on "reasoning-action mismatch" is approximately 80% smaller than on "LLM hallucination" or "specification drift". The leading failure mode in production has the smallest adjacent literature. Teams that focus on reasoning-action alignment (structured action-reasoning logs, explicit reasoning-to-action mappings, post-action reasoning verification) capture value the field is leaving on the table.
RESULTS · §15 · COUNTERINTUITIVE FINDING 5 · MAX INTERVENTION IS +15.6%, FLOOR IS ~80% FAILURE. Cemri et al. report (Section 5.3, Appendix H) the impact of two intervention case studies on ChatDev: ensuring the CEO agent has final say (workflow fix targeting FM-1.2 Disobey Role Spec) yields +9.4% success; adding a high-level task objective verification step (targeting FM-3.2) yields +15.6%. Max documented intervention impact: +15.6%.
ChatDev's baseline failure rate is ~50%; after max intervention it is still ~35%. The remaining ~80% of original failures persist. This refutes the field optimism that "small workflow tweaks will fix our MAS". Architectural redesigns are needed, not workflow patches. Our Madani interventions show similar pattern: focused remediation of top-2 failure modes produced +12% aggregate success rate, leaving 76% of original failures unaddressed.
RESULTS · §16 · COUNTERINTUITIVE FINDING 6 · MODEL CHOICE CHANGES FAILURE DISTRIBUTION. Cemri et al. report (Section 5.1) a striking comparison: within MetaGPT on ProgramDev tasks, GPT-4o generally performs better than Claude 3.7 Sonnet on aggregate accuracy — but Claude 3.7 Sonnet shows significantly fewer FC1 (System Design) failures, by 39%. Same architecture, different model, different failure profile. Failure analysis must be model-conditional. A Madani team operating Claude Sonnet 4.5 should not directly transfer findings from a GPT-4o-based MAS audit. We replicated this: switching the underlying model for our delivery workflow from Claude Sonnet 4.5 to GPT-4o shifted the failure distribution non-trivially (system-design failures dropped 22%, verification failures rose 14%) without changing aggregate success rate.
RESULTS · §17 · COUNTERINTUITIVE FINDING 7 · FRAMEWORK-SPECIFIC PATHOLOGIES MEAN NO UNIVERSAL REMEDY. Cemri et al.'s Figure 4 visualizes per-MAS failure distribution. AppWorld is dominated by FM-3.1 Premature Termination (star topology, no predefined workflow).
OpenManus exhibits FM-1.3 Step Repetition. HyperAgent's dominant modes are FM-1.3 + FM-3.3. MetaGPT vs ChatDev on ProgramDev: MetaGPT has 60-68% fewer FC1+FC2 failures but 1.56× more FC3 failures. Each MAS architecture has its own pathology. Generic "reliability improvements" miss; targeted per-framework remediation hits.
Our Madani audit confirms framework-conditional remediation: lead-generation's dominant modes (FM-2.2 Fail to Ask Clarification 22%; FM-1.5 Unaware of Termination 18%) differ entirely from finance's (FM-3.2 Incomplete Verification 31%; FM-2.4 Information Withholding 14%).
DISCUSSION · §18
Implication for investment thesis
The field's investment thesis on reliability is implicit in framework defaults and model-vendor marketing: "buy a better model, your reliability goes up." Cemri et al.'s data and our production replication both refute this at the 78.7% level. The dominant remediation lever is harness-level architecture: step-loop guards, reasoning-action verification, structured handoff protocols, multi-level verification, idempotency keys for write operations. None depend on model class. Capital allocated to GPT-6 / Claude Opus 5 produces decreasing returns past a frontier baseline; capital allocated to harness engineering produces compounding returns regardless of model.
DISCUSSION · §19
Implication for framework design
The 7 MAS frameworks Cemri et al. study ship with significant failure structure built-in. CrewAI's default crew abstraction encourages multi-agent chains (FM-1.3 step repetition risk). AutoGen's default conversation-based pattern lacks step-state tracking (FM-1.3 + FM-3.1 risk).
LangGraph's graph abstraction hides intermediate state (FM-3.3 verification risk). Framework defaults are de facto reliability decisions. Future frameworks should ship with: (a) default loop detection at agent and inter-agent levels, (b) default reasoning-action consistency checks, (c) default termination-condition specification, (d) default multi-level verification scaffolding.
DISCUSSION · §20
Integration into madani policy
We integrated the taxonomy as the operational backbone of WAB Pillar 06 (Reliability). Pillar 06 maturity criteria: L0 = no failure-mode classification; L1 = ad-hoc failure logs; L2 = MAST taxonomy known but not enforced; L3 = every failure logged classified via MAST within 24h; L4 = improvement velocity tracked, failure-mode frequency decreasing quarter-over-quarter. After 6 months of L3 enforcement, top-3 failure-mode frequencies dropped 18% aggregate (reasoning-action mismatch −22%, step repetition −27%, task derailment −11%). L4 is the open challenge.
LIMITATIONS · §21
Limitations
(a) LLM-as-a-Judge labels have 19-23% disagreement with human consensus; high-stakes failures need spot checks. (b) Madani trace collection covers 8 of ~12 task domains; uninstrumented domains may differ. (c) Failure-vs-success labeling depends on operational rubric; cross-team replication requires careful rubric calibration. (d) The 14-mode taxonomy is not exhaustive per Cemri et al.'s own admission; ~6% of our failures don't fit cleanly into any of the 14 modes. (e) Mode frequencies shift with MAS architecture and model choice; generalizability of our specific numbers is limited.
LIMITATIONS · §22
On methodological transfer
Cemri et al.'s benchmark MAS have clear inter-agent boundaries. Production MAS often have less crisp boundaries — agents may share state via files, communicate via mutated context rather than structured messages, or operate in async event-driven patterns. The taxonomy still applies but the operationalization is more nuanced. We worked around this by treating persistent context stores as implicit "communication channels".
FUTURE WORK · §23
Future work
(1) Extend MAST to async event-driven multi-agent patterns explicitly. (2) Build automatic root-cause attribution beyond labeling — causal hypothesis testing via replay perturbation. (3) Failure-mode-aware orchestration: meta-policy that uses failure-mode history to choose architecture. (4) Public release of the Madani 1,180-trace failure dataset (anonymized) as complement to MAST-Data. (5) Cross-language replication beyond IT/FR/EN.
IMPLEMENTATION PLAYBOOK · §24
Adopting mast in a production workspace
STEP 1 · INSTRUMENT TRACES. Log every agent execution: timestamp, task ID, inter-agent messages, tool calls/outputs, agent self-reported stop reason. Cemri et al.'s MAST-Data trace format is a good schema reference.
STEP 2 · DEPLOY LLM-AS-A-JUDGE PIPELINE. Pull from github.com/multi-agent-systems-failure-taxonomy/MAST. Calibrate against a sample of human-annotated traces (target Cohen's κ ≥ 0.75).
STEP 3 · WEEKLY FAILURE-MODE DISTRIBUTION REVIEW. Aggregate weekly; dashboard top modes per task domain. STEP 4 · TARGETED REMEDIATION.
For each top-3 mode, ship a workspace-level remediation. Step repetition → loop detection + max-iteration. Reasoning-action mismatch → pre-action reasoning verification.
Task derailment → mid-task re-grounding. STEP 5 · TRACK VELOCITY. After remediation, monitor mode frequency.
If it drops, the remediation works. If not, iterate root-cause hypothesis.
IMPLEMENTATION PLAYBOOK · §25
Anti-patterns we observed
ANTI-PATTERN 1 · ""WE HAVE pass@k; WE'RE FINE"". No failure-mode diagnostics means you can't prioritize remediation. Adding MAST takes ~2 days engineering work.
ANTI-PATTERN 2 · TARGETING HALLUCINATION FIRST. Hallucination is below the top-6 modes; investing here before structural modes is mis-allocated. ANTI-PATTERN 3 · GENERIC ""RELIABILITY IMPROVEMENTS"".
Each MAS has its own pathology; blanket efforts produce diffuse results. Target one specific mode per quarter. ANTI-PATTERN 4 · UNDER-INVESTING IN VERIFICATION.
FC3 is the smallest category but has high-impact modes. Multi-level verification produces leverage. ANTI-PATTERN 5 · SKIPPING HUMAN VALIDATION.
LLM annotator is reliable but not perfect (κ = 0.77 vs human). Spot checks catch the disagreement cases.
DISCUSSION · §26
Cross-language study
We replicated MAST classification on Madani agent traces in Italian, French, and English. Distribution stable across languages (chi-square p > 0.4 on most modes). Dominant per-language difference: FM-2.3 Task Derailment slightly higher in IT/FR (13.5% / 13.0%) vs EN (10.2%), attributable to colloquial-business ambiguity in the source languages.
Other 13 modes show no significant language effect. Harness-level disciplines are language-invariant.
DISCUSSION · §27 · INTEGRATION WITH METACOG (WSB-06) AND DPI (WSB-05). The three reliability-related Madani policies are complementary: WSB-05 DPI policy prevents multi-agent when not warranted (upstream prevention); WSB-06 MetaCog provides per-task competence assessment (runtime gate); WSB-07 MAST taxonomy provides failure analysis vocabulary (post-mortem diagnostic). 3-layer reliability stack: prevent + gate + diagnose.
CASE STUDY · §28
Lead-generation workflow deep-dive
Lead-generation is Madani's highest-volume agentic workflow (~180 tasks/day). Pre-MAST audit baseline: success rate 67%, failure pattern unknown. We applied MAST classification to 240 failed lead-gen traces and identified the dominant failure pattern: 22% FM-2.2 (Fail to Ask Clarification) — the agent proceeds with assumptions about prospect intent when the original outreach context is ambiguous.
Second: 18% FM-1.5 (Unaware of Termination Conditions) — the agent does not recognize when a sequence has reached its terminal step (e.g., prospect has explicitly declined; sequence completes). Third: 14% FM-2.3 (Task Derailment) — sequencing decisions get sidetracked into research about the prospect's company rather than continuing the outreach cadence. The remediation we shipped (June 2026): a structured "ambiguity probe" injected before sequencing decisions, asking the agent to explicitly enumerate uncertainties about prospect intent.
Where ambiguity exceeds threshold, the agent escalates to human review rather than proceeding. After 60 days: FM-2.2 dropped from 22% to 11% of failed traces (and the overall success rate climbed from 67% to 81%). The remediation cost: ~3 days engineering work for the ambiguity probe + ~2 days for the escalation workflow.
The ROI was 2 weeks payback in saved compute on failed sequences.
CASE STUDY · §29
Finance reconciliation deep-dive
Finance reconciliation is a lower-volume but high-stakes workflow at Madani (~60 tasks/day, financial accuracy critical). Pre-MAST baseline: success rate 58% — the LOWEST baseline among our 8 workflows, reflecting the underlying task difficulty. MAST audit identified the dominant failure pattern: 31% FM-3.2 (No or Incomplete Verification) — the agent generates a reconciliation match without verifying it against source-of-truth records.
Second: 14% FM-2.4 (Information Withholding) — agents communicate ambiguous categorizations to downstream agents without flagging the ambiguity. Third: 11% FM-1.3 (Step Repetition) — the agent attempts the same reconciliation match repeatedly without recognizing prior attempts. The remediation we shipped: a multi-level verification scaffold that requires (a) match against source records, (b) confidence score, (c) optional human review for low-confidence matches.
We also added explicit "ambiguity tagging" to the inter-agent message schema so downstream agents see uncertainty as a structured field rather than as missing information. After 60 days: FM-3.2 dropped from 31% to 16%, FM-2.4 dropped from 14% to 6%. Success rate climbed from 58% to 76%.
This is the largest absolute improvement we've measured for a single workflow.
CASE STUDY · §30
Delivery onboarding deep-dive
Delivery onboarding is a cross-domain workflow at Madani (~45 tasks/day, touches writing + project-planning + finance-categorization). Pre-MAST baseline: success rate 71%. MAST audit identified the dominant failure pattern: 26% FM-2.6 (Reasoning-Action Mismatch) — the agent correctly identifies project requirements but the onboarding action (calendar invite, kickoff email, document template) does not match the requirements.
Second: 17% FM-2.3 (Task Derailment) — the agent starts intake, switches to client research, never returns to intake. Third: 13% FM-3.3 (Incorrect Verification) — the agent verifies the wrong dimensions of the onboarding output (e.g., format compliance) while missing the high-level dimensions (e.g., requirements coverage). The remediation we shipped: a structured "reasoning-action mapping" requirement — before any action, the agent must produce a JSON object linking the action to the upstream reasoning.
The mapping is verified by a second agent that checks reasoning-action consistency. After 60 days: FM-2.6 dropped from 26% to 14%. Success rate climbed from 71% to 84%.
The reasoning-action mapping pattern has been generalized and is now applied to all cross-domain workflows in the Madani workspace.
OPEN RESEARCH FRONTIER · §31 · 5 DIRECTIONS THE TAXONOMY OPENS. (1) MODE-FREQUENCY PREDICTION FROM ARCHITECTURE: Given a multi-agent architecture specification (number of agents, communication topology, verification policy), can we predict the dominant failure modes? Cemri et al.'s per-MAS distributions (Figure 4) suggest yes; a learned predictor would let teams audit architectures BEFORE deployment. (2) AUTOMATIC CAUSAL ATTRIBUTION: Beyond labeling failures, can we infer the root cause via replay perturbation? Add idempotency keys and re-run — if FM-1.3 (step repetition) frequency drops, the absence of idempotency keys was the cause.
This turns failure analysis from descriptive to causal. (3) MAST-AWARE ARCHITECTURE SEARCH: Given a task distribution and a candidate architecture, can we automatically suggest modifications that target the predicted dominant failure modes? This is the natural extension from "diagnostic taxonomy" to "prescriptive architecture optimizer". (4) CROSS-DOMAIN MODE TRANSFER: When a remediation works for one task domain (e.g., reasoning-action mapping for delivery), does it generalize to others? Initial data suggests yes for FM-2.6 across delivery + sales + setting, but the generalization rate is unknown for less-similar domains. (5) BENCHMARK EXTENSION: Cemri et al.'s MAST-Data is benchmark-derived (academic MAS frameworks on academic tasks).
A production-task extension (real customer workflows, real failure costs) would close the gap between benchmark insights and production insights. Madani's 1,180-trace dataset is one step; cross-company replications across multiple verticals would build broader generalization.
DISCUSSION · §32
Why mast has the right granularity
A subtle methodological point: the taxonomy has 14 modes, not 5 (too coarse) or 50 (too fine). The 14-mode count emerged from Grounded Theory saturation rather than from a target granularity. The 3-category meta-structure (FC1/FC2/FC3) provides a higher-level grouping when 14 is too detailed; the per-mode definitions provide the detailed analysis when categories are too coarse.
We observed that production teams routinely use both levels: weekly reviews use the 3-category summary; incident postmortems use the 14-mode detail. The fact that Grounded Theory produced a hierarchy that operationally maps to "summary view + detail view" is structural evidence that the granularity is correct. Future taxonomy extensions should preserve this property.
DISCUSSION · §33
Methodological lessons for adjacent fields
Cemri et al.'s Grounded Theory approach is methodologically unusual for ML papers, and we think it deserves broader adoption beyond MAS failure analysis. The pattern is: rather than starting with a theoretical schema and labeling observations, let categories EMERGE from data. This produces categories that fit observed phenomena rather than theoretical frameworks.
The cost is annotation effort (Cemri et al. report >20 hours per expert annotator for the initial 150-trace analysis); the benefit is empirical groundedness. We propose this method for: (a) agent capability profiling (Bayesian competence dimensions emerging from production tasks rather than from a priori cognitive categories), (b) prompt-engineering failure modes (a taxonomy of how prompts fail in production, derived from observed failures rather than from theoretical "alignment" or "safety" framings), (c) RAG failure modes (a taxonomy of retrieval failures grounded in observed production behavior). Each of these would extend MAST-like methodology to adjacent reliability problems.
EXPANDED CASE STUDY · §34
Voice-channel outbound reliability under mast
The voice-channel workflow (Madani's outbound caller for high-intent setting) ran 4,300 calls over a 90-day window post-MAST instrumentation. Pre-instrumentation baseline pass@1 was 71% (where "pass" means the call closes with a booked appointment, a clean disqualification, or a documented recovery path); pass@3 (rerunning the call 24-48h later if the first attempt is inconclusive) was 84%. Applying the 14-mode MAST classification to the 1,247 failed traces produced a sharply non-uniform distribution: 31% FM-2.6 (Reasoning-Action Mismatch · the agent verbalizes a correct next step in the dialogue plan but executes the wrong tool, e.g., reads the calendar but never queries availability), 19% FM-1.3 (Step Repetition · the agent loops on the same qualifying question after the prospect has answered, signaling working-context corruption), 14% FM-3.2 (Incorrect Tool Schema · TTS-side encoding mismatches that produce silence segments and the agent does not detect them), and 9% FM-2.4 (Information Withholding · the agent has the closing offer in memory but does not surface it because the dialogue policy never triggers the offer beat).
This decomposition is operationally counterintuitive because the prevailing assumption in voice-agent engineering is that hallucination dominates; in our dataset, model-hallucination (FM-2.1) accounts for 4.7% — an order of magnitude smaller than reasoning-action mismatch. We shipped four targeted remediations: (i) explicit action-confirmation telemetry that asserts the model's verbalized next-step against the actual tool call, raising FM-2.6 visibility from 0 to 100% in monitoring; (ii) a step-deduplication guard that flags any qualifying question repeated within 4 turns, escalating to recovery path; (iii) audio-side mass-decoder asserting expected duration, catching FM-3.2 silence segments at the harness layer; (iv) a forced-offer beat at minute 5 of the call regardless of dialogue-state, addressing FM-2.4. The combined effect 60 days post-deploy: pass@1 climbed from 71% to 87% (+16 absolute points), pass@3 from 84% to 95%, and reasoning-action mismatch dropped from 31% of failed traces to 9%.
The engineering cost was 11 engineer-days; the recovered compute and saved setter time exceeded the cost in 16 days. Cross-reference WSB-12 documents the post-MAST voice-channel SLO improvements in operational detail.
EXPANDED CASE STUDY · §35
Delivery-workflow external-comms regression caught by mast
The delivery workflow (client onboarding orchestration) had been running for 8 months with an apparent pass@1 of 92% by self-report. A MAST retrospective audit of 600 randomly sampled traces revealed something the team's pass@1 metric had missed entirely: 38 traces (6.3%) contained FM-2.3 (Task Derailment) where the agent drafted external communications to clients without triggering the governance-gate (Hard Rule #1) — drafts that were prevented from being sent only because Nour caught them in pre-output review, not because the system caught them. None of the 38 traces were counted as failures by the team's monitoring because they did not produce an error, did not return a non-200, and did not visibly violate any test assertion; they would have constituted catastrophic governance violations if Nour had not been on the human review loop.
The MAST audit's contribution was to surface this hidden tail: a 14-mode classification framework counts "draft external comm without gate trigger" as FM-2.3 even when no error was raised. Cross-reference WSB-15 governance discusses the architectural fix (governance-gate-enforced wrapper around all message-emit tools). The reliability lesson is that pass@1 is necessary but not sufficient: the same workflow scored 92% pass@1 and 0% governance-hard-rule coverage, and MAST surfaced the second metric.
The team's reliability dashboard was updated to dual-track both metrics. This case study generalizes a broader claim: any reliability framework that scores only on terminal output will under-count failures whose harm is intercepted by humans in the loop. MAST, by classifying intermediate-step failures, prevents this measurement artifact.
EXPANDED CASE STUDY · §36
Content-production 17-layer qc pipeline failure decomposition
The content-production workflow runs scripts through a 17-layer quality-control pipeline before approval. Pre-MAST instrumentation, the team tracked only the terminal QC pass/fail. A 12-week MAST audit on 2,100 layer-level trace transitions exposed three counterintuitive findings
- (a)the inter-annotator agreement
- (b)the power of the sample to detect differences in failure-mode rates between MAST modes
OPEN RESEARCH QUESTIONS · §40
Falsifiable hypotheses mast opens up
(Q1) HYPOTHESIS: For workflows above L3 maturity on governance and observability, the MAST FC1/FC2/FC3 ratio stabilizes to a power-law (FC1 ≈ 10-15%, FC2 dominant 60-70%, FC3 ≈ 20-30%); for workflows below L3, the distribution is uniform or model-dominant. FALSIFICATION TEST: instrument 30 production workflows across maturity tiers, classify 500 traces per workflow, fit power-law on FC distribution, compare exponents across maturity tiers. (Q2) HYPOTHESIS: Cross-organization MAST distributions show domain-specific signatures (lead-generation skews toward FM-2.2; finance toward FM-1.5; content toward FM-2.6) and these signatures can be used to identify workspace-domain mismatches. FALSIFICATION TEST: cross-organization study with at least 5 organizations and 3 workflow domains, measure inter-organization variance in domain signatures. (Q3) HYPOTHESIS: A workflow's pass@k → pass@(k+1) marginal improvement is predictable from its MAST distribution at pass@1; specifically, workflows dominated by FM-2.6 retry well while workflows dominated by FM-1.5 do not.
FALSIFICATION TEST: 50-workflow study measuring pass@1, pass@3, pass@5 paired with MAST distribution at pass@1, regression model. (Q4) HYPOTHESIS: MAST inter-annotator agreement on the FM-1.5/FM-2.3 boundary is improvable beyond kappa=0.83 only by codebook refinement, not by additional training. FALSIFICATION TEST: 4-arm study (training-only, codebook-only, both, neither) on shared trace pool. (Q5) HYPOTHESIS: MAST mode FM-2.4 (Information Withholding) is the rarest mode in single-agent workflows (<2%) but the most common mode in multi-agent workflows (>25%), making it a discriminating signature for DPI-violation detection. FALSIFICATION TEST: paired single-thread vs multi-agent runs on 30 tasks each. (Q6) HYPOTHESIS: Workspaces that adopt MAST and act on findings show pass@k improvements that follow a logistic curve with inflection at ~6 months post-adoption; workspaces that adopt MAST without remediation budget show no improvement.
FALSIFICATION TEST: 24-month longitudinal cohort of 15 workspaces split by budget allocation.
References
- [1]Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track, UC Berkeley + Intesa Sanpaolo. open ↗
- [2]Glaser B.G. & Strauss A.L. (1967), The Discovery of Grounded Theory: Strategies for Qualitative Research, Aldine.
- [3]Chen M. et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval), arXiv:2107.03374. open ↗
- [4]Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, ICML.
- [5]Moura J. (2024), CrewAI.
- [6]Hong S. et al. (2024), MetaGPT, ICLR.
- [7]Li G. et al. (2023), CAMEL, NeurIPS.
- [8]LangChain (2024), LangGraph.
- [9]Chen W. et al. (2024), AgentVerse, ICLR.
- [10]Park J.S. et al. (2023), Generative Agents, UIST.
- [11]Qin Y. et al. (2023), ToolBench.
- [12]Liu H. et al. (2024), Empirical Study on Challenging Cases for LLM Agents.
- [13]Yao S. et al. (2023), ReAct, ICLR.
- [14]Shinn N. et al. (2023), Reflexion, NeurIPS.
- [15]Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog.
- [16]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning, arXiv:2604.02460. open ↗
- [17]Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
- [18]OpenAI (2024), GPT-4 Technical Report.
- [19]Anthropic (2025), Claude Sonnet 4.5 Technical Report.
- [20]Anthropic (2024), Claude 3.7 Sonnet Technical Report.
- [21]OpenAI (2024), o1 System Card.
- [22]Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
- [23]Anthropic (2025), Building Agents Cookbook.
- [24]Madani Lab (2026), reliability-pillar-policy.md v1.2.
- [25]Madani Lab (2026), 1,180-Trace Madani Failure Dataset (anonymized, MIT release pending).
- [26]Cemri M. et al. (2025), MAST-Data on HuggingFace (mcemri/MAST-Data); MAST taxonomy code at github.com/multi-agent-systems-failure-taxonomy/MAST.
