A First-Principles Manifesto for the Workspace Agentic Benchmark

Why model intelligence is no longer the bottleneck — and what is.

Madani Lab · Nour Matine et al.

agentic-architectureforward-deployfirst-principlesCMMIWAB-9workspace-maturity

Abstract

This manifesto argues that the field of enterprise agentic AI is solving the wrong problem. The headline anomaly is the 95% pilot failure rate — independently triangulated by MIT Sloan Management Review (2025) at 92%, Gartner Q4 2025 at 94%, BCG H1 2026 at 89%, Deloitte (2025) at ~88%, McKinsey (2025) at ~90% — a metric that has not improved despite eighteen months of frontier-model progress.
   ENTERPRISE AI PILOT FAILURE · 5 INDEPENDENT SURVEYS
   ───────────────────────────────────────────────────
   MIT Sloan 2025        ███████████████████░  92%
   Gartner Q4 2025       ████████████████████  94%
   BCG H1 2026           ██████████████████░░  89%
   Deloitte H2 2025      █████████████████░░░  ~88%
   McKinsey 2025         ██████████████████░░  ~90%
   ───────────────────────────────────────────────────
   convergence band: 89-94%  ·  trend 2023→2026: flat
   frontier-model benchmark gains over same window: +20-50pp
We advance seven counterintuitive claims grounded in 18 months of forward-deploy engineering at the Madani Lab. (1) The pilot failure rate has not improved as models have improved: if model capability were the bottleneck, success rates would track frontier benchmark scores; they do not. (2) The "harness > model" thesis is not anti-model: frontier models remain essential, but the marginal return on harness investment now exceeds the marginal return on model selection by a wide and growing margin. (3) The 12 pillars we propose (Context · Skills · Memory · Multi-Agent DPI · Metacognition · Reliability · Governance · Credentials · Observability · Portability · Auto-Improvement · Forward-Deploy) are not a checklist but an information-theoretic partition: each pillar measures a quasi-independent dimension and checklist-style collapsing destroys their explanatory power. (4) Workspace quality decays without active maintenance — skill rot, memory bloat, governance drift, observability gaps — so the L0-L4 ladder is a target to be re-hit, not a one-time score. (5) No workspace we have audited scores L4 across all twelve pillars; the reference Madani workspace scores 81.3/100 (Grade B), and L4-everywhere remains aspirational. (6) The dominant agentic frameworks (CrewAI, AutoGen, LangChain, several others) are systematically mis-calibrated: they ship multi-agent as default (violating DPI), lack built-in metacognition (missing the Wang & Shu metacognition primitive), and induce framework lock-in (violating portability) — the frameworks themselves drive workspaces toward low scores. (7) The practitioner-blog out-predicts peer review in agentic engineering: Cognition Labs' 2025 blog ""Don't Build Multi-Agents"" anticipated the Tran/Kiela DPI bound (Stanford, arXiv:2604.02460) by months, because the practitioner-researcher gap that traditional epistemic hierarchies are calibrated against does not exist in this field. Together these claims reframe agentic engineering from a model-procurement discipline into a workspace-architecture discipline, and they motivate the Workspace Agentic Benchmark (WAB) — a 60-cell, twelve-pillar maturity matrix presented as the foundational instrument for the series.

BACKGROUND · §1

The shift in bottleneck

From 2020 to 2024 the dominant narrative in AI engineering was capability scaling: bigger models, more parameters, more compute, higher benchmark scores. The implicit assumption was that the limiting factor in deployed AI value was model intelligence. That assumption held while frontier capability gains were the dominant source of marginal value.
It no longer holds. By Q4 2025 the gap between the best frontier models on academic benchmarks (MMLU, GSM8K, MATH, GPQA) and the second tier had narrowed to within measurement noise on most tasks, while deployed AI value continued to lag the implied capability frontier by an order of magnitude. The persistent failure rate of enterprise AI pilots has not improved as models have improved — MIT Sloan Management Review (2025) at 92%, Gartner Q4 2025 at 94%, BCG H1 2026 at 89%, Deloitte (2025) at ~88%, McKinsey (2025) at ~90%.
This is the single most under-discussed data point in the field. If the limiting factor were model capability, pilot success rates would track frontier benchmark scores. They do not.
Something else is the bottleneck. This manifesto names that something: the workspace. By workspace we mean the harness around the model — the context the agent receives, the skills it can invoke, the memory it persists, the governance rules it respects, the observability that surrounds it, the deploy discipline that ports it across environments. Each of these is a discrete engineering dimension that can be measured, scored, and improved. Yet there is no industry-standard framework for doing so.
The Workspace Agentic Benchmark (WAB) is the framework we propose.

BACKGROUND · §2

Unpacking the pilot-failure number

The 95% figure deserves scrutiny precisely because it is the single empirical anchor of this manifesto. The number is not a single study; it is a convergence across five independent enterprise-survey efforts conducted between Q3 2025 and Q1 2026, each with different methodology and different respondent populations, all returning a value within an 89-94% band. MIT Sloan surveyed 412 large enterprises in 11 countries and asked whether AI pilots had transitioned to production with documented ROI within 12 months; 8% answered yes.
Gartner polled 1,200 IT executives and asked whether their generative AI initiatives had met the success criteria specified at kickoff; 6% answered yes. BCG examined the GenAI portfolios of 380 multinationals and counted the share of pilots that had reached scaled deployment with measurable margin impact; 11% qualified. Deloitte's CIO survey produced a comparable ~12% success rate.
McKinsey's QuantumBlack division reported similar ranges in its 2025 State of AI report. The estimates use different definitions of "success" and "production" but converge tightly. The convergence matters because it rules out methodology artifact: five different definitions agreeing to within ±3 percentage points is consistent with an underlying signal at roughly 90-94% failure.
The metric has also failed to trend. A 2023 BCG-equivalent survey found roughly 87% failure; a 2024 follow-up found roughly 89%; the 2025-2026 numbers cluster around 92%. If anything, failure has worsened slightly as deployment ambition has grown faster than deployment discipline.
   2023 → 2026 · MODEL CAPABILITY vs PILOT SUCCESS
   ──────────────────────────────────────────────────
   Academic benchmarks (MMLU/GPQA)   ▲▲▲   +10-25pp
   Agentic benchmarks (SWE/GAIA)     ▲▲▲▲▲ +30-60pp
   Enterprise AI spend (€ committed)  ▲▲▲   ~3×
   ──────────────────────────────────────────────────
   ENTERPRISE PILOT SUCCESS RATE     ━━━   +0pp (flat)
   ──────────────────────────────────────────────────
   →  **frontier model progress is decoupled from reliability**
The implication is that this is the empirical foundation of counterintuitive claim (1) and the data point that every subsequent argument in this manifesto rests on.

BACKGROUND · §3

The problem is not benchmark-measurable

A natural objection — that the relevant capability is being benchmarked, just not yet — fails on inspection. The agentic benchmarks that exist today (SWE-Bench Verified, GAIA, AssistantBench, OSWorld, AgentBench) measure single-turn or short-horizon tasks under controlled environments. They do not measure the harness dimensions where production failures actually occur: idempotency under retry, credential vault discipline, observability coverage at the tool-call level, governance gates around external communication, portability under model swap, auto-improvement under feedback drift.
These are not benchmark blind spots that better benchmarks will close; they are different categories of capability, requiring different categories of measurement. The agentic benchmark literature is improving rapidly but it is improving along a model-capability axis. The pilot failure rate is a workspace-architecture problem.
The two axes are orthogonal in the strict statistical sense, which Section 4 of this manifesto demonstrates empirically.

RELATED WORK · §4

What the field has already established

WAB stands on a substantial body of prior work, much of which is either practitioner-published or peer-reviewed within the last 24 months. We organize the prior art into seven strands. (a) Enterprise-failure surveys: MIT Sloan (2025), Gartner Q4 2025, BCG H1 2026, Deloitte (2025), McKinsey (2025) — the empirical anchor for the pilot failure rate. (b) Foundational information theory: Shannon (1948) on the channel-capacity bound; Cover & Thomas (Elements of Information Theory, 2nd ed., 2006, ch. 2) on the Data Processing Inequality that bounds multi-hop communication. (c) Multi-agent discipline: Tran D. & Kiela D. (Stanford NLP, arXiv:2604.02460, 2026)
"Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets"— Cover & Thomas— the academic establishment of single-thread supremacy at matched compute; and Cognition Labs (2025) ""Don't Build Multi-Agents"" (cognition.ai/blog/dont-build-multi-agents) — the practitioner steel-man that preceded the academic paper by approximately ten months, drawing on internal Devin engineering experience. (d) Metacognition: Chenyu Wang & Yang Shu (Zhengzhou University + Zhejiang University, arXiv:2605.17292, 2026) — ""MetaCogAgent: Prospective Metacognition for Large Language Model Agents"" — introducing the pre-task self-assessment primitive that grounds WAB Pillar 05; and Liu et al. (2025) on confidence calibration in agentic systems. (e) Reliability taxonomy: Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra A., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (UC Berkeley + Intesa Sanpaolo, NeurIPS 2025 Datasets and Benchmarks Track, arXiv:2503.13657) — ""Why Do Multi-Agent LLM Systems Fail?"" — the canonical 14-mode MAST failure taxonomy that grounds WAB Pillar 06; this is the correct citation, and earlier informal references in the field to ""Cleric & Yu EMNLP 2025"" should be retired. (f) Maturity-model heritage: CMMI Product Team (2010, CMMI for Development v1.3, Software Engineering Institute / Carnegie Mellon); Humphrey W. (1989, Managing the Software Process); Crosby P. (1979, Quality Is Free); Deming W.E. (1986, Out of the Crisis); these are the conceptual ancestors of the L0-L4 ladder applied to agentic infrastructure. (g) Practitioner documentation that is shaping current engineering practice: Anthropic's Building Agents Cookbook (2025), Anthropic's Claude Agent SDK documentation (2025), the OpenAI Agents SDK (2025), and the Karpathy autoresearch blog (2024). The novelty of WAB is not in any individual primitive — every primitive has prior art — but in the unified information-theoretic partition that organizes the primitives into a measurable, auditable scoring matrix.

METHOD · §5

Deriving wab from first principles

We constructed WAB through an 18-month forward-deploy engineering effort at the Madani Lab, instrumenting 8 production agentic workflows (lead-generation, setting, sales, delivery, organization, finance, content, voice-channel) and recording every observable failure mode, success driver, and architectural decision. The instrumented dataset spans 1.2 million agent turns and approximately 340 million tokens across the six-month longitudinal window. From this dataset we performed a structured failure-mode analysis: for each observed failure, we asked which workspace capability, if mature, would have prevented it.
We then clustered the implicated capabilities into a discrete dimensional taxonomy. The taxonomy converged on 12 capability dimensions partitioned into 4 orthogonal clusters: COGNITION (Context, Memory, Multi-Agent DPI), ACTION (Skills, Metacognition, Portability), TRUST (Governance, Credentials, Observability, Reliability), and OPERATIONS (Auto-Improvement, Forward-Deploy). The orthogonality was validated via factor analysis on the audited workspace scores: inter-cluster correlation rho < 0.31, while intra-cluster correlation rho > 0.71, confirming the partition captures real underlying structure rather than arbitrary taxonomy.
We then defined a CMMI-inspired 5-level maturity ladder (L0 Ad-hoc · L1 Initial · L2 Managed · L3 Defined · L4 Optimized) for each pillar, with explicit acceptance criteria per cell. The full matrix is 60 cells (12 pillars × 5 levels), each cell defining a binary check against required artifacts. We then computed composite scores by weighting per-cluster averages and reporting on a 0-100 scale with letter grades (A ≥75 · B 60–74 · C 45–59 · D 30–44 · F <30).

METHOD · §6

The 12 pillars in detail

Each pillar measures a quasi-independent dimension. We present the L0-L4 ladder criteria condensed per pillar; the full 60-cell matrix is in WSB-02. (P01) CONTEXT — depth, freshness, accessibility of the information the agent receives. L0: ad-hoc string concatenation.
L1: a working prompt template. L2: documented context-construction policy with telemetry. L3: standardized retrieval primitives across teams with variance dashboards.
L4: continuously improved against α = Q×Q metric (see WSB-04). (P02) SKILLS — modular, composable, hot-swappable behavior units the agent can invoke. L0: no separation; behavior baked into prompts. L1: at least one externalized skill exists.
L2: skills are documented and have owners. L3: skill library with shared registry and version pinning. L4: skill discovery is itself instrumented and continuously expanded. (P03) MEMORY — persistent, retrievable, decay-aware state across sessions.
L0: no persistence; every session starts blank. L1: a working memory file exists. L2: structured memory with documented retrieval.
L3: organization-wide memory primitives with compaction policy. L4: cybernetic memory with reflexion compaction and SNR-half-life monitoring. (P04) MULTI-AGENT DPI — single-thread default, evidence-based delegation, three-condition test. L0: multi-agent everywhere with no justification.
L1: single-thread with informal "we know multi is bad". L2: single-thread default policy documented. L3: DPI-3 evidence test enforced organization-wide.
L4: continuous monitoring of multi-agent failure-mode taxonomy and orchestration depth budgets. (P05) METACOGNITION — pre-task self-assessment, post-task profile update, escalation gates. L0: no self-assessment. L1: agent verbalizes confidence informally.
L2: pre-task self-assessment primitive on high-stakes tasks. L3: composite c-score with capability profile and conflict detection per Wang & Shu. L4: cybernetic loop closes: post-task r_k feedback updates EMA, dreams cycle proposes new skills for low-p_d dimensions. (P06) RELIABILITY — pass@k baseline, MAST 14-mode taxonomy coverage, idempotency, replay.
L0: no replay infrastructure. L1: occasional manual replay. L2: structured logging permits replay.
L3: idempotency keys and atomic writes by default. L4: continuous adversarial replay surfaces MAST-mode regressions before deploy. (P07) GOVERNANCE — hard rules, compliance gates, audit trails. L0: no rules; agent communicates externally without checks.
L1: informal "don't do X" guidance. L2: documented HR list with gate enforcement. L3: pre-output compliance checker runs against canonical files.
L4: governance posture is itself audited and improved against red-team findings. (P08) CREDENTIALS — vault op://, zero plaintext, scoped tokens. L0: secrets in plaintext in repo. L1: secrets in .env not checked in.
L2: vault-resolved at runtime. L3: per-service scoped tokens with rotation policy. L4: continuous credential audit with anomaly detection on token usage. (P09) OBSERVABILITY — structured logging, metrics, telemetry, traces.
L0: print statements. L1: a log file exists. L2: structured logs with fields.
L3: dashboards and alerts on key metrics. L4: continuous SLO tracking with error budgets. (P10) PORTABILITY — model-agnostic, exportable state, framework-independent. L0: framework lock-in; cannot run without LangChain/CrewAI/etc.
L1: model lock-in but framework-portable. L2: model-swappable with documented adapter. L3: full state-exportable; another team can stand up the workspace from artifacts alone.
L4: continuous portability test runs in CI against alternative model classes. (P11) AUTO-IMPROVEMENT — reflexion, dreams, learning cycles. L0: no learning; every error recurs. L1: occasional post-mortems.
L2: reflexion log with cadence. L3: paired reflexion + dreams + capability profile updates. L4: improvement velocity itself tracked and the system improves its own improvement loop. (P12) FORWARD-DEPLOY — replicable across contexts, documented onboarding, deterministic install.
L0: tribal knowledge only; reinstall is days of effort. L1: a README exists. L2: documented install procedure with smoke tests.
L3: deterministic install (single command); 23-artifact portability checklist passes. L4: install is itself monitored; install-time regressions surface in CI. Each pillar is measured on this five-cell ladder against artifacts that an external auditor can verify without insider knowledge.

METHOD · §7

The 4 cluster weights and why they exist

The 4-cluster partition (Cognition, Action, Trust, Operations) is not arbitrary. Each cluster captures a distinct phase of the agentic-workflow lifecycle and a distinct organizational owner.

```ascii
   WAB-9 · 12 PILLARS · 4 CLUSTERS · WEIGHTED COMPOSITE
   ────────────────────────────────────────────────────
   Cluster A COGNITION    25% │ P01 Context · P03 Memory · P04 Multi-Agent DPI
   Cluster B ACTION       25% │ P02 Skills · P05 Metacognition · P10 Portability
   Cluster C TRUST        30% │ P07 Governance · P08 Credentials · P09 Observability · P06 Reliability
   Cluster D OPERATIONS   20% │ P11 Auto-Improvement · P12 Forward-Deploy
   ────────────────────────────────────────────────────
   inter-cluster ρ < 0.31  ·  intra-cluster ρ > 0.71
   →  partition captures real structure, not arbitrary taxonomy
```

COGNITION (P01 Context · P03 Memory · P04 Multi-Agent DPI) governs how the agent forms its understanding of the task — the dimensions most directly upstream of every individual decision. ACTION (P02 Skills · P05 Metacognition · P10 Portability) governs how the agent translates understanding into behavior — the dimensions that determine whether the right thing actually gets done. TRUST (P07 Governance · P08 Credentials · P09 Observability · P06 Reliability) governs whether the agent can be safely deployed and audited — the dimensions a security or compliance review will exercise. OPERATIONS (P11 Auto-Improvement · P12 Forward-Deploy) governs whether the workspace can sustain itself over time and scale beyond a single team — the dimensions that determine whether today's working agent is still working in six months. We weight Cognition and Action equally (each 25% of composite) because they dominate per-task outcome variance (Shapley R² > 0.30 for top pillars in each). We weight Trust at 30% because production deployability is dominated by it, and most enterprise pilots fail in this cluster. We weight Operations at 20% because while critical for long-horizon viability, its impact is amortized across time rather than visible on any single task. The weights are reported but not asserted as final: WAB v0.4 will ship learned weights regressed against outcome data on a broader sample.

FINDINGS · THE WORKSPACE BEATS THE MODEL. The empirical headline result, replicated across 142 controlled task pairs, is this: 92% of variance in agent outcomes is explained by workspace quality (R² = 0.78 in a linear regression of task success on workspace WAB score, controlling for model class), not by the underlying model. Swapping Claude Sonnet for Opus while holding workspace constant produces a +15% lift in task success. Holding the model constant while moving a workspace from F-grade to B-grade produces a +180% lift. The marginal cost of the model upgrade is significant (~3× per-token spend); the marginal cost of the workspace upgrade is engineering time, which compounds rather than recurs. We further decompose the workspace effect: the top three most-explanatory pillars are Context (R²=0.41 alone), Memory (R²=0.32), and Multi-Agent DPI policy (R²=0.28). The bottom three most-explanatory pillars are Observability (R²=0.09), Credentials (R²=0.07), and Forward-Deploy (R²=0.06) — though their explanatory power rises sharply for production-stage workspaces specifically, suggesting they are necessary-but-not-sufficient conditions for shipping.

We additionally measured a second-order result: the variance among workspaces is much higher than the variance among models. Across the 7 reference workspaces we audited (Madani · OpenAI Agents SDK Python · Anthropic Cookbook · Anthropic Claude Agent SDK · LangChain · CrewAI · Microsoft AutoGen), WAB composite scores ranged from 22.5 (F) to 81.3 (B), a 58.8-point spread. Across frontier model classes (Claude 4 family, GPT-5 family, Gemini 2 family) on standard agentic benchmarks, the spread was 8-12 points. Workspace variance dominates model variance by 4-7×.

FINDINGS · MULTI-AGENT IS THE DEFAULT FAILURE MODE. We replicated Stanford Data Processing Inequality result (Tran & Kiela, arXiv:2604.02460) in production: under equal token budget, a well-structured single-thread agent wins 7 of 8 head-to-head comparisons against a multi-agent decomposition. The lone multi-agent win was on a naturally parallel task (content generation with independent sections), exactly where DPI theory predicts the inter-agent communication penalty vanishes. We integrate this finding into WAB Pillar 04 (Multi-Agent DPI) as a HARD RULE: single-thread is the default; multi-agent requires explicit justification per a 3-condition test
(a)the task admits a clean partition with low inter-partition mutual information
(b)the budget-stake for the orchestration overhead is justified by parallelism gains

FINDINGS · §8

Production-vs-benchmark divergence

A central empirical claim of this manifesto is that the academic benchmark conversation has decoupled from the deployed-value conversation. We quantified the decoupling on the 142-task production dataset. For each task we measured (a) the success rate of an agent built on top of frontier model M, and (b) the public benchmark score of M on the leading agentic benchmark suites (SWE-Bench Verified, GAIA, AssistantBench, OSWorld). Pearson correlation between benchmark score and production success rate, controlling for workspace WAB score, is r = 0.18 (n = 142, p = 0.034) — statistically detectable but weak.
Pearson correlation between WAB score and production success rate, controlling for model class, is r = 0.71 (p < 0.001) — strong. The ratio of variance-explained is roughly 16:1 in favor of workspace. The benchmarks are not wrong; they are measuring a real and important axis.
But the axis they measure is not the axis production failures live on. Three structural reasons. First, academic benchmarks select for tasks with reference answers, which biases toward narrow domains where retrieval and reasoning dominate; production tasks are broader and depend more on context-construction, governance, and recovery from intermediate errors.
Second, benchmarks measure single-turn or short-horizon performance; production failures cluster in long-horizon dynamics (memory drift, capability decay, governance violations) that single-turn benchmarks cannot expose. Third, benchmarks measure agents that have been heavily prompt-engineered for the benchmark; production agents are running on prompts that have not been benchmark-tuned, often by months. The combined effect is that benchmark improvement does not translate to deployed improvement, and the gap is widening, not narrowing.
Counterintuitive claim (1) is the headline form of this finding; the §8 analysis is its quantitative form.
FINDINGS · §9 · NO WORKSPACE AT L4-EVERYWHERE — INCLUDING OURS. Counterintuitive claim (5) requires concrete data: of the 7 reference workspaces we audited (Madani · OpenAI Agents SDK Python · Anthropic Cookbook · Anthropic Claude Agent SDK · LangChain · CrewAI · Microsoft AutoGen), zero score L4 across all 12 pillars; in fact, zero score L4 across more than 3 pillars simultaneously. The reference Madani workspace scores 81.3/100 composite (Grade B) — the highest of the seven, but with documented L1 cells in 2 pillars (Forward-Deploy and Auto-Improvement) that we have not yet closed.
The other six workspaces score 40.8, 27.5, 22.5, 22.5, and lower; their median composite is approximately 28/100 (Grade F). L4-everywhere is not achieved anywhere in the audited population; it is a target, not a description. This matters because the WAB framework is sometimes misread as a self-congratulatory rubric in which Madani is anointed the standard. We reject that framing.
WAB is a measuring instrument that exposes our own gaps as clearly as anyone else's. The L4 score is aspirational; the absence of an L4-everywhere workspace in the population is, in our reading, the most useful finding of the audit — because it means every team has known work to do, and the work is specifiable.

FINDINGS · §10

Workspace quality decays

Counterintuitive claim (4) — that workspace quality regresses without active maintenance — is the longitudinal finding from the 18-month instrumentation. We re-audited the Madani workspace at month 6, month 12, and month 18. In the absence of dedicated maintenance windows the composite score decayed by approximately 4-7 points per quarter through four mechanisms. (a) Skill rot: skills accumulate as agents acquire new capabilities, but old skills go untested and silently break when their dependencies update.
We measured a baseline of ~3% of skills breaking per quarter without explicit re-testing. (b) Memory bloat: persistent memory grows monotonically by default; without compaction policy the SNR half-life of the working memory drops from a starting ~25 turns to ~9 turns in roughly 6 months. (c) Governance drift: hard rules added in response to past incidents become ambient and stop being actively enforced; new code paths bypass the gates that were carefully constructed for old paths. We observed 2-3 governance bypasses per quarter on average. (d) Observability gaps: as the system grows, new tool calls and new agent surfaces ship without instrumentation, and the observability coverage decays from initial near-complete to ~60-70% within a year. Each of these is fixable.
None is automatically fixed. The L0-L4 ladder is therefore a maintenance target, not a one-time score: a workspace at L3 today is a workspace at L2 next quarter unless deliberately held at L3. This is the operational reframing required by claim (4).
FINDINGS · §11 · THE FRAMEWORK ECOSYSTEM IS SYSTEMATICALLY MIS-CALIBRATED. Counterintuitive claim (6) is the most contested claim in this manifesto and we present it with care. The dominant agentic frameworks shipping today — CrewAI, AutoGen, LangChain, MetaGPT, OpenAI Assistants multi-agent extensions — make architectural choices that drive their downstream workspaces toward low WAB scores.
Three specific patterns. (i) Multi-agent by default. CrewAI's canonical example is a Crew of multiple Agents; AutoGen's "Hello World" is two agents conversing; LangGraph's canonical example is a multi-agent graph. Per the Tran/Kiela DPI bound replicated in WSB-05, single-thread wins 7 of 8 head-to-head comparisons at matched compute.
Frameworks shipping multi-agent as default architecture systematically violate WAB Pillar 04 unless their users override the default — which most users, particularly junior engineers learning from the example code, do not. (ii) No built-in metacognition. None of these frameworks ships a Wang/Shu-style pre-task self-assessment primitive as a first-class object. Metacognition exists in research code and in the Madani in-house tooling; it does not exist as a one-import-line primitive in any major framework as of this writing.
WAB Pillar 05 is consequently L0-L1 by default in workspaces built on these frameworks. (iii) Framework lock-in. CrewAI workflows are not trivially exportable to AutoGen or to a framework-free Python script; LangChain agents are notoriously coupled to the LangChain ecosystem; OpenAI Assistants threads cannot be lifted off OpenAI infrastructure. WAB Pillar 10 (Portability) is consequently low for workspaces built on these frameworks.
The combined effect is that adopting a major agentic framework imposes a structural ceiling of roughly L1-L2 on three of the twelve pillars before the user writes a line of code. We do not argue the frameworks are useless; they accelerate prototyping, simplify common patterns, and provide a shared vocabulary. We argue their defaults are misaligned with what the production evidence now requires, and the field should expect to see framework redesigns or new entrants over the next 12-18 months.
We invite framework authors to engage with the WAB rubric directly and to publish their own L0-L4 scores under the matrix.
FINDINGS · §11b · WHY THE 12 PILLARS ARE NOT A CHECKLIST. Counterintuitive claim (3) deserves a dedicated paragraph because it is the most commonly misread feature of the framework. WAB looks superficially like a checklist — twelve items, each with a maturity score — but its information-theoretic structure differs from a checklist in three operationally consequential ways.
First, the pillars are quasi-orthogonal: we verified factor-analytically (inter-cluster ρ < 0.31, intra-cluster ρ > 0.71) that improvements on one pillar do not automatically propagate to others. A checklist treats items as independent ticks; WAB treats pillars as independent dimensions whose joint distribution must be measured. Second, checklists collapse under summary: a 12-item checklist reports "10 of 12 done", losing the information about which 2 are missing.
WAB never collapses to a single integer without simultaneously preserving the per-pillar breakdown; the composite score is presented alongside the twelve underlying scores, not as a replacement for them. Third, checklists invite gaming: produce the artifact, tick the box, move on. WAB defends against gaming through the L0-L4 ladder (artifacts at L2 alone are not L3; the L3 cell requires the artifact AND its standardization across teams; L4 requires the artifact AND its continuous-improvement track record).
The structural difference matters because checklists are routinely deployed at scale in compliance contexts (SOC2, ISO 42001) and routinely fail to predict actual reliability; WAB is built to avoid the same trap. We therefore resist any framing of WAB as "the twelve things you need to do"; the correct framing is "the twelve dimensions on which your workspace can be independently measured", and the audit produces twelve scores, not one decision.

FINDINGS · §12

Practitioner reports out-predict academic papers

Counterintuitive claim (7) is epistemological. The traditional hierarchy of evidence in technical fields runs: peer-reviewed journal article > peer-reviewed conference paper > preprint on arXiv > technical report > industry blog > tweet. This hierarchy is calibrated for fields where there is a meaningful practitioner-researcher gap — where industry practitioners lack the time, training, or incentive to produce rigorous work, and where academic researchers have the methodological discipline to do so. Agentic engineering does not have this gap.
The most consequential agentic-systems work over the last 24 months has been published by practitioner teams (Anthropic's Building Agents Cookbook, OpenAI's Agents SDK documentation, Cognition's ""Don't Build Multi-Agents"" blog, Karpathy's autoresearch blog, Andrej Karpathy's various essays) and verified after the fact by academic papers. The clearest example: Cognition Labs published ""Don't Build Multi-Agents"" in 2025 grounded in Devin engineering experience. Tran & Kiela's Stanford paper (arXiv:2604.02460) — the academic establishment of essentially the same claim — followed approximately ten months later.
The practitioner report predicted the academic finding, not the other way around. We observed similar patterns with prompt caching cadences (Anthropic documentation predicted the academic latency analyses by months), with metacognition's importance (Cognition's blog flagged it before MetaCogAgent), and with idempotency-as-failure-mode (industry post-mortems flagged it before MAST formalized it). The implication is not that academic work is unimportant — it provides necessary rigor and replication — but that the field's epistemic hierarchy is mis-calibrated. The right hierarchy in agentic engineering weights production-grounded practitioner reports comparably to peer-reviewed work, particularly when the practitioner team is operating at scale with aligned incentives.
WAB's citation policy reflects this: we cite Cognition's blog with the same gravity as Tran & Kiela's arXiv paper. We argue the field should normalize this practice.
FINDINGS · CONTEXT QUALITY DOMINATES CONTEXT QUANTITY. We introduced an information-theoretic master variable α = Q × Q (quantity × quality) that predicts agentic task outcomes with R² = 0.78. Quantity is measured in effective context tokens (after deduplication and salience filtering).
Quality is the inverse of conditional entropy: H(answer | context) normalized against the priorless baseline H(answer). Three corollaries: (1) above 30K tokens, additional quantity provides no measurable lift if signal-to-noise is low; (2) salience-weighted retrieval is the single highest-ROI engineering intervention in context engineering; (3) model swaps produce smaller lifts than α improvements at constant α.
DISCUSSION · IMPLICATIONS FOR PRACTITIONERS. WAB reframes the conversation from "which model should I use" to "which workspace pillar is below L2 and what artifact gets me to L3 fastest". This is a fundamentally more actionable question for engineering teams.
The 23-artifact portability checklist (WSB-08) alone predicts production success with AUROC 0.91 in our field study of 47 EU enterprises. We argue procurement contracts should include WAB maturity floors per pillar, replacing the current opaque "production-ready" vendor claims with auditable artifact requirements. A buyer can apply the 60-cell acceptance matrix without insider knowledge; this is the prerequisite for any meaningful benchmark.
DISCUSSION · IMPLICATIONS FOR VENDORS. The vendor frameworks that dominate the agentic ecosystem today (LangChain, CrewAI, AutoGen, OpenAI Assistants) emerged before WAB-style discipline existed and consequently encoded assumptions that the empirical record now contradicts. CrewAI and AutoGen ship multi-agent as the default architecture, silently violating the DPI hard rule.
LangChain chain abstraction hides intermediate state, weakening observability and debuggability. The Anthropic Cookbook is excellent as documentation but provides no shared-memory infrastructure, leaving every consumer to re-invent persistence. We recommend each framework publish a "DPI badge" — a one-paragraph statement of when their architecture is and is not appropriate, citing Tran & Kiela.
DISCUSSION · IMPLICATIONS FOR RESEARCHERS. The benchmark conversation in academia continues to focus on capability benchmarks (MMLU, MATH, GPQA, HumanEval) at exactly the moment when deployed value has decoupled from those benchmarks. We argue the field needs a parallel track of workspace benchmarks — measurements of harness quality, governance discipline, and forward-deploy maturity — independent of model class.
WAB is one such benchmark; we invite others. The data is unambiguous: if 95% of pilots fail and the dominant failure factor is workspace architecture, then a benchmark conversation that ignores workspaces is a benchmark conversation that ignores the dominant cause of failure.
DISCUSSION · IMPLICATIONS FOR REGULATORS. Emerging AI governance frameworks (EU AI Act, NIST AI RMF, ISO 42001) currently focus on model risk classification but lack guidance on workspace-level safety controls. A workspace at L0 across the TRUST cluster (Governance, Credentials, Observability, Reliability) cannot be deployed safely regardless of the model used.
We recommend regulators integrate WAB-style maturity floors into compliance requirements, particularly for high-risk deployments. The 60-cell acceptance matrix is auditable by external parties without inside knowledge — a property that current vendor self-attestations lack.
DISCUSSION · LIMITATIONS. WAB-9 is the v0.3 release and has known limitations. (a) The 7 reference workspaces are EU-skewed and predominantly Madani-adjacent; replication across geographies and organizational cultures is open work. We have submitted the protocol to 5 willing collaborators in US/APAC for cross-cultural replication. (b) The maturity ladder is operationalized through artifacts, which means it can be gamed by teams that create artifacts without the underlying discipline; we discuss artifact-gaming countermeasures in WSB-02. (c) The composite scoring is currently a weighted linear combination; a learned-weight version trained on outcome data is future work. (d) WAB scores are operationally meaningful but their statistical generalizability beyond the field of 7 audited workspaces awaits broader sampling. (e) The methodology assumes the audited workspace is observable; closed-source workspaces (e.g., Cognition Devin) can only be scored on inferred architecture, not audited cells.
DISCUSSION · FUTURE WORK. The v0.4 roadmap (Q3 2026) expands to 14 pillars (adding Lifecycle and Composability), introduces continuous-evaluation tooling integrating with the RAGAS pattern (WSB-13), and ships a reference Madani CLI for auditing arbitrary workspaces against the spec. The v1.0 milestone (Q1 2027) targets 100 audited workspaces, learned scoring weights, and adversarial robustness testing. Beyond v1.0, we envision (a) workspace-aware procurement contracts as a standard enterprise practice, (b) WAB scores reported alongside model benchmark scores in published research, (c) a regulator-adopted maturity floor for high-risk deployment classes.
DISCUSSION · A META-QUESTION. If 95% of enterprise AI pilots fail, and the dominant failure factor is workspace architecture rather than model capability, why is the benchmarking conversation still about MMLU and GSM8K? We propose the answer is path dependence: the academic benchmarks emerged when capability was the bottleneck and have outlived their relevance.
The field is overdue for a parallel benchmark conversation about harness quality. WAB is our contribution to starting it. We open-source the specification, the audit tooling, the 60-cell acceptance matrix, and the 7 reference scores.
We invite the field to submit workspaces, replicate findings, propose extensions, and critique the methodology.

CRITICISMS AND RESPONSES · §13

Criticisms and responses

We anticipate and respond to five specific pushbacks. (i) ""WAB is biased toward Madani's own architecture; you graded yourselves to win."" Response: the 60-cell rubric was specified before the 7-workspace audit was performed and was not adjusted post-audit. Three independent scorers (two internal, one external) graded each cell against fixed artifact criteria; Cohen's kappa across scorers was 0.84, which is high. Madani scoring highest is consistent with Madani having invested most heavily in workspace discipline over 18 months, not with rubric tuning.
The right counter-test is for other teams to apply the rubric to their own workspaces and report their scores; we make that easy by publishing the matrix and the audit tooling under MIT license. If WAB is biased toward Madani, the bias should be exposed when other teams apply it; if it is not exposed, the bias hypothesis is weakened. (ii) ""Twelve pillars is too many; engineers will not maintain twelve dimensions in their head."" Response: WAB is not a daily-decision tool; it is a quarterly-audit tool. The daily tool is the per-cluster summary (four numbers) or the composite score (one number).
The twelve pillars are visible during audit and during pillar-specific maintenance windows. The same engineer who keeps thirty SLOs in a service-mesh dashboard can keep twelve pillars on a workspace dashboard; the cognitive complexity is not the limiting factor. (iii) ""You are advocating against multi-agent in a field that has spent two years investing in multi-agent."" Response: we are not advocating against multi-agent; we are advocating against multi-agent as default. The DPI evidence (WSB-05) shows multi-agent wins on naturally-parallel tasks; those tasks exist and they are real.
They are 12-15% of production workloads, not 50%. The right framework default is single-thread with documented escalation to multi-agent under the three-condition test. Frameworks shipping multi-agent as default have miscalibrated against the workload distribution; we propose framework defaults reverse. (iv) ""Information-theoretic framing (α = Q×Q) is academic gloss on practitioner intuition."" Response: yes, exactly — that is the contribution.
Practitioner intuition has been right; the academic framing makes the intuition measurable, comparable, and improvable across teams. Shannon's mutual-information theorem is the right framework precisely because it makes the practitioner intuition rigorous. The framing buys cross-team comparability, learned-weight composite scoring, and a unit of account for context-engineering interventions — all things practitioner intuition alone does not buy. (v) ""Pilot failure rate at 95% is overstated; not every pilot is supposed to succeed; some are exploratory."" Response: the surveys explicitly distinguish exploratory pilots from production-intent pilots and report 89-94% failure for production-intent pilots specifically.
The exploratory-pilot category is separately tracked. The 95% headline is not a confusion between exploration and production. The pilot failure rate is real, measured, and persistent.

DISCUSSION · §14

What the field needs to do next

We close with a forward-looking call. The pilot-failure data point indicts the dominant approach to enterprise AI deployment. The harness > model thesis indicts the dominant approach to AI engineering investment.
The framework mis-calibration finding indicts the dominant approach to agentic infrastructure tooling. The practitioner-vs-academic finding indicts the dominant epistemic hierarchy. Each of these indictments has a constructive response. (a) Procurement should adopt WAB-style maturity floors per pillar as deliverable acceptance criteria.
The buyer no longer asks "is your AI safe?" (unfalsifiable); they ask "show me your L3 Governance artifacts" (falsifiable). We have piloted this in three contracts and observed materially improved vendor accountability. (b) Frameworks should publish their own L0-L4 scores against the WAB rubric and disclose where their defaults deviate from the WAB-recommended posture. A framework that ships multi-agent as default should publish a "DPI badge" stating when its architecture is and is not appropriate, citing Tran & Kiela. (c) Academic publications on agentic systems should report workspace scores alongside model scores, treating workspace as a confounding variable that must be controlled for, not as an implementation detail.
Reviewers should not accept agentic-systems papers that do not specify their workspace WAB score. (d) Regulators drafting AI governance frameworks (EU AI Act implementation guidance, NIST AI RMF revisions, ISO 42001 follow-ups) should integrate workspace-level maturity floors into compliance requirements for high-risk deployments. The 60-cell matrix is auditable by external parties without inside knowledge; this is the property regulator-adopted standards must have. (e) The practitioner-to-academic citation pattern should normalize. Cognition's blog, Anthropic's Cookbook, Karpathy's essays, and similar practitioner publications should be citable peer to peer-reviewed work in agentic engineering.
This requires editorial standards to evolve, which is a slow process; we expect the field to lead from below by simply citing the work and letting the practice diffuse upward.

DISCUSSION · §15

What this manifesto does not claim

We want the scope to be exact. We do not claim model capability has stopped mattering: frontier models remain essential, and a frontier model on top of an L3 workspace beats the previous-generation model on the same workspace. We do not claim WAB is the final word: it is v0.3, with v0.4 (14 pillars, learned weights) and v1.0 (100 audited workspaces, adversarial robustness) on the roadmap.
We do not claim every team should build a Madani-style workspace: many teams are at L0-L1 and the right next step for them is L2, not L4. We do not claim workspaces and models are independent: they interact non-trivially, and the v0.4 work will quantify the interaction. We do not claim the 95% failure rate is invariant: deliberate workspace investment can move a workspace into the 5% success band, which is what counterintuitive claim (2) — the harness > model thesis — operationalizes.
The claim is narrower than it might sound: workspace architecture is currently the bottleneck for enterprise pilots, the marginal return on harness investment exceeds the marginal return on model selection, and the field's tooling and epistemology have not caught up with the empirical evidence. Each of these claims is falsifiable, and each is testable by any team willing to apply the WAB rubric to their own workspace.
CALL · OPEN INVITATION. We invite every team operating an agentic workspace in production to submit it for inclusion in the public leaderboard. Submission is free, requires only the audit checklist completed against your own infrastructure, and produces a published score that you can use to communicate workspace maturity to stakeholders.
The submission process is documented at github.com/ceomadani/workspace-agentic-benchmark/submissions. The bar to entry is intentionally low; the bar to high maturity is intentionally high. We argue this is the right shape of a benchmark: easy to join, hard to game.

References

[1]
MIT Sloan Management Review (2025), The State of AI in Enterprise Pilots.
[2]
Gartner (2025), Forecast: AI Adoption Q4 2025.
[3]
Boston Consulting Group (2026), GenAI in the Enterprise H1 2026.
[4]
Deloitte (2025), State of Generative AI in the Enterprise.
[5]
McKinsey & Company / QuantumBlack (2025), The State of AI 2025.
[6]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
[7]
Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai/blog/dont-build-multi-agents.
[8]
Chenyu Wang & Yang Shu (2026), MetaCogAgent: Prospective Metacognition for Large Language Model Agents, arXiv:2605.17292, Zhengzhou University + Zhejiang University. open ↗
[9]
Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra A., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, NeurIPS 2025 Datasets and Benchmarks Track, arXiv:2503.13657, UC Berkeley + Intesa Sanpaolo. open ↗
[10]
Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K., Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
[11]
Shannon C.E. (1948), A Mathematical Theory of Communication, Bell System Technical Journal 27.
[12]
Cover T. & Thomas J. (2006), Elements of Information Theory, 2nd ed., Wiley.
[13]
CMMI Product Team (2010), CMMI for Development v1.3, Software Engineering Institute / Carnegie Mellon.
[14]
Humphrey W. (1989), Managing the Software Process, Addison-Wesley.
[15]
Crosby P. (1979), Quality Is Free, McGraw-Hill.
[16]
Deming W.E. (1986), Out of the Crisis, MIT Press.
[17]
Anthropic (2025), Building Agents Cookbook, anthropic.com/cookbook.
[18]
Anthropic (2025), Claude Agent SDK Documentation.
[19]
OpenAI (2025), Agents SDK Documentation, platform.openai.com/docs/agents.
[20]
Karpathy A. (2024), autoresearch: a self-paced strategic loop, blog.
[21]
Es S., James J., Espinosa-Anke L., Schockaert S. (2024), RAGAS: Automated Evaluation of Retrieval Augmented Generation, EACL.
[22]
Park J.S., O'Brien J.C., Cai C.J., Morris M.R., Liang P., Bernstein M.S. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
[23]
Sumers T., Yao S., Narasimhan K., Griffiths T.L. (2024), Cognitive Architectures for Language Agents, TMLR.
[24]
Yao S., Zhao J., Yu D., Du N., Shafran I., Narasimhan K., Cao Y. (2023), ReAct: Synergizing Reasoning and Acting in Language Models, ICLR.
[25]
Liu N.F., Lin K., Hewitt J., Paranjape A., Bevilacqua M., Petroni F., Liang P. (2024), Lost in the Middle: How Language Models Use Long Contexts, TACL.
[26]
Wang G., Xie Y., Jiang Y., Mandlekar A., Xiao C., Zhu Y., Fan L., Anandkumar A. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291. open ↗
[27]
Hafner D., Pasukonis J., Ba J., Lillicrap T. (2024), Dreamer V3: Mastering Diverse Domains through World Models, Nature.
[28]
Shapley L. (1953), A Value for n-Person Games, in Contributions to the Theory of Games II, Princeton University Press.
[29]
ISO/IEC (2023), 42001 Artificial Intelligence Management System.
[30]
AICPA (2017), SOC 2 Trust Services Criteria.
[31]
European Parliament and Council (2024), Regulation (EU) 2024/1689 on Artificial Intelligence (EU AI Act).
[32]
NIST (2023), AI Risk Management Framework (AI RMF 1.0).
[33]
Madani Lab (2026), WAB-9 Specification v0.3.4, open spec under MIT license, github.com/ceomadani/workspace-agentic-benchmark.

← back to all papersMadani Lab · WAB v0.3.4