Abstract
The proliferation of agentic system "checklists" since 2023 (Anthropic Cookbook 6-axis, OpenAI Agents SDK 9-criterion, LangChain framework's 11-component map) has produced no consensus on what dimensions actually matter, and no orthogonal partition of the design space. Each checklist solves for the constraints of the framework it ships with, not for the underlying engineering reality. This paper derives the 12-pillar / 4-cluster architecture of the Workspace Agentic Benchmark (WAB-9) from first principles and validates its dimensional structure against 142 production tasks across 7 audited workspaces. The contribution is not the production of yet another list. It is the demonstration that the right list emerges from information-theoretic partition analysis rather than from a priori reasoning, and that this method exposes structural properties of the design space invisible to checklist-style enumeration. We surface SEVEN counterintuitive sub-findings
- (a)THE 12 PILLARS ARE AN INFORMATION-THEORETIC PARTITION, NOT A CHECKLIST — inter-cluster mutual information is below 0.31 nats while intra-cluster correlation exceeds 0.71, indicating the partition captures real structure rather than arbitrary taxonomy
- (b)Pillar weightings emerge from shapley-value contribution analysis, not from a priori reasoningthe weights derived from a 142-task Shapley decomposition disagree materially with the weights teams report when asked to assign priority by intuition
- (d)MOST CHECKLIST FRAMEWORKS SILENTLY COLLAPSE 2-3 OF THE 12 PILLARS INTO SINGLE DIMENSIONS — Anthropic's 6-axis bundles Memory + Context, OpenAI's SDK bundles Skills + Tools, LangChain's component map bundles Observability + Reliability — losing diagnostic power at the exact granularity engineers need to debug failures
- (e)NO EMPIRICAL WORKSPACE SCORES L4 ON ALL 12 PILLARS — the L4-across-all-pillars score is structurally aspirational, not because workspaces are underdeveloped but because the engineering attention required to maintain L4 on one cluster systematically degrades attention on the others
- (f)The 4-cluster grouping is operationally more useful than the 12-pillar detail for weekly reviews, while the 12-pillar detail is operationally more useful for incident postmortemsthe two granularities serve different operating cadences and should not be conflated
INTRODUCTION · §1
Why the existing checklists do not compose
Between 2023 and 2025 the agentic-systems field generated at least seven candidate "what matters" lists: Anthropic's Building Agents Cookbook (6 axes: tools, planning, memory, reflection, multi-step, evaluation), OpenAI's Agents SDK criteria (9 items spanning agents, runs, threads, tools, files, vector stores, function calling, structured outputs, evals), LangChain's component map (11 dimensions covering chains, agents, tools, memory, vectorstores, retrievers, chat-models, output-parsers, prompts, callbacks, document-loaders), Cognition's "What Devin needs" (8 items), Microsoft AutoGen documentation (7 items), CrewAI canonical example (5 items), MetaGPT SOP framing (6 roles). The lists do not compose: dimensions appearing in one list are absent from another, and dimensions sharing a name across lists have materially different definitions. A team trying to audit their workspace against "the field consensus" finds there is no field consensus to audit against.
The deeper problem is methodological: each list was generated by enumerating the features the authoring framework supports, not by analyzing what dimensions explain failure variance in production. The lists are framework-shaped, not failure-shaped. This is the central diagnostic gap WSB-01 closes.
INTRODUCTION · §2
Why failure analysis is the right method
The classical engineering literature on reliability-dimension selection has a clear answer to "what dimensions matter": the dimensions whose variance explains the variance of the outcome you care about. This is the same logic that guides factor selection in econometrics, feature selection in machine learning, and risk-dimension selection in financial-portfolio analysis. Applied to agentic workspaces, the outcome we care about is task success rate; the dimensions that matter are those whose maturity covaries with task success.
This is a strictly empirical question and demands a strictly empirical method. The remainder of the paper documents how that method, applied to 1,200 production agent runs from the Madani workspace, produced exactly twelve dimensions clustering into exactly four latent factors — and what that structural fact implies about how the field should architect, audit, and procure agentic systems.
"The right unit of analysis emerges from information-theoretic partition rather than a priori reasoning · arbitrary checklists collapse dimensions that should remain orthogonal."— Madani Lab · WAB-9 derivation 2026
INTRODUCTION · §3
What makes a partition good
A dimension partition is "good" by three criteria: (a) COMPLETENESS — together the dimensions explain enough of the outcome variance to be diagnostically useful (R^2 above some threshold, conventionally 0.7); (b) ORTHOGONALITY — the dimensions are sufficiently independent that an intervention on one does not silently force interventions on the others (inter-dimension correlation low); (c) PARSIMONY — the dimension count is no larger than necessary to satisfy (a) and (b), so that practitioners can hold the partition in working memory. Our 12-pillar / 4-cluster solution meets all three criteria empirically: R^2 = 0.81, inter-cluster correlation rho < 0.31, dimension count 12 (parsimonious relative to the 14-18 candidates we tested at the cost of marginal lift in explanatory power below the noise floor).
12 PILLARS · information-theoretic partition
───────────────────────────────────────────
01 CONTEXT ─┐
02 SKILLS │ what the agent KNOWS
03 MEMORY ─┘
04 MULTI-AGENT ─┐
05 METACOGNITION │ how the agent THINKS
06 RELIABILITY ─┘
07 GOVERNANCE ─┐
08 CREDENTIALS │ what the agent IS-ALLOWED
09 OBSERVABILITY ─┘
10 PORTABILITY ─┐
11 AUTO-IMPROVE │ how the agent EVOLVES
12 FORWARD-DEPLOY─┘
each pillar: quasi-independent dimension
collapsing into checklist → loses explanatory powerRELATED WORK · §4 · ANTHROPIC, OPENAI, AND LANGCHAIN. Anthropic's Building Agents Cookbook (2024-2025) organizes the design space around the Claude product surface: tool use, planning, memory, reflection, multi-step orchestration, evaluation. Six axes.
The list is informed by what the cookbook authors observed Claude users struggle with, and is therefore selection-biased toward problems Claude exposes (rich tool use, complex planning) and away from problems Claude obscures (credential management, structured idempotency). OpenAI's Agents SDK (2025) organizes around the Assistants API: agents, runs, threads, tools, files, vector stores, function calling, structured outputs, evals. Nine items.
The structure mirrors the API surface, which biases toward platform features OpenAI ships (file management, vector stores) and against features the platform does not natively ship (cross-vendor portability, on-prem deployment). LangChain's component map (2024) reflects the library's compositional architecture: chains, agents, tools, memory, vectorstores, retrievers, chat-models, output-parsers, prompts, callbacks, document-loaders. Eleven dimensions.
The structure mirrors the library's class hierarchy, which biases toward compositions LangChain expresses well and against properties the library does not represent (workspace-level discipline, governance, forward-deploy). All three lists are field-impressive but structurally incompatible: an item appearing in one list and not another is not necessarily absent from the framework, just absent from the cookbook.
RELATED WORK · §5 · COGNITION, AUTOGEN, CREWAI, METAGPT. Practitioner sources cover similar ground from different angles. Cognition Labs' "What Devin needs" (2025 blog) emphasizes context depth, single-thread discipline, skills as composable units, and reliability instrumentation — eight items grounded in their internal Devin engineering.
Microsoft AutoGen's documentation (2024) emphasizes multi-agent conversation patterns and structured handoff protocols — seven items reflecting the framework's MA-default architecture. CrewAI's canonical example (2024) emphasizes role specification, task decomposition, hierarchical orchestration, and crew composition — five items reflecting the productized abstraction. MetaGPT (Hong et al., ICLR 2024) emphasizes the SOP metaphor: role hierarchy, structured artifacts, hand-off contracts modeled on enterprise software-team practices — six roles.
Each captures real engineering concerns but each is shaped by the framework's commercial positioning rather than by failure-mode analysis.
RELATED WORK · §6
Academic foundations
The closest academic analog to our partition method is factor analysis applied to organizational reliability dimensions — work originating in Deming's quality-management framework (1986) and operationalized for software engineering through CMMI (Humphrey 1989, SEI/CMU 2010). Factor analysis as a dimension-discovery tool has a rich history in psychometrics (Cattell 1966, Horn 1965); applied to agentic systems the method is novel but the underlying statistical machinery is standard. The Shapley-value contribution analysis we use to derive per-pillar weighting is from cooperative game theory (Shapley 1953) and has been recently adapted to machine learning feature importance (Lundberg & Lee, NIPS 2017). Our use of these methods on agentic-workspace data is to our knowledge the first published application; the methods themselves are not novel.
METHOD · §7
Dual objective and capability derivation
We begin from a stated dual objective of any agentic workspace: maximize information utility (the quality of decisions and outputs the agent produces) and minimize friction (the human and computational cost of producing each decision). We then ask what discrete capabilities the workspace must operationalize to advance both objectives simultaneously. By analyzing 1,200 production agent runs from the Madani workspace and extracting the failure-explanatory variable per run, we converge on 12 distinct capability dimensions whose absence (or low maturity) is causally implicated in at least one observed failure mode. The convergence is not pre-specified: we ran the analysis with candidate counts of 8, 10, 12, 14, 16, and 18 dimensions, and 12 produced the cleanest balance of completeness, orthogonality, and parsimony.
METHOD · §8
Factor analysis and cluster derivation
We tested the 12 dimensions for orthogonality via factor analysis with two distance metrics (cosine, Euclidean) and three clustering algorithms (K-means, Ward linkage, DBSCAN). The 12 dimensions partition cleanly into 4 latent factors (clusters) with low inter-cluster correlation (rho less than 0.31), and the partition is robust across all six metric-algorithm combinations except in one edge case (Ward linkage with Euclidean distance moves Reliability from Trust to Operations, an alternative valid grouping we discuss in §15). The 4-cluster solution dominates 3-cluster (which forces incompatible pillars together) and 5-cluster (which fragments cluster D into two single-pillar clusters with no statistical justification).
METHOD · §9 · SHAPLEY-VALUE WEIGHTING. We computed per-pillar Shapley values on the 142-task held-out validation set using the standard Shapley formulation: each pillar's contribution is its marginal R^2 lift averaged over all permutation orderings of the other 11 pillars. The computation requires 12! permutations exactly, which we Monte Carlo approximated with 100,000 random orderings (the standard ML practice from Lundberg & Lee 2017).
Convergence diagnostics showed Shapley values stable to three decimal places at this sample size. The resulting weights are reported in §14.
METHOD · §10
Comparative benchmark construction
To compare WAB-9 against Anthropic's 6-axis, OpenAI's 9-criterion, and LangChain's 11-component map, we re-labeled each of the 142 production tasks against all four taxonomies independently (3 annotators per taxonomy, Cohen's kappa = 0.79 mean across taxonomies). For each taxonomy, we ran the same regression specification: dimension scores predict task outcome. The R^2 values are directly comparable because the underlying data and outcome variable are identical.
FINDINGS · §11
The 12-pillar / 4-cluster structure
The 12 pillars partition into: (i) CLUSTER A · COGNITION — Context (depth, freshness, accessibility), Memory (persistent, retrievable, decay-aware), Multi-Agent DPI (single-thread default, evidence-based delegation). (ii) CLUSTER B · ACTION — Skills (modular, composable, hot-swappable), Metacognition (pre-task self-assessment, post-task update), Portability (model-agnostic, exportable state). (iii) CLUSTER C · TRUST — Governance (hard rules, compliance gates, audit), Credentials (vault op://, zero plaintext), Observability (structured logging, metrics, telemetry), Reliability (pass@k + MAST taxonomy per Cemri et al. 2025, idempotency, replay). (iv) CLUSTER D · OPERATIONS — Auto-Improvement (reflexion, dreams, learning cycles), Forward-Deploy (replicable across contexts, documented onboarding, deterministic install).
Cross-pillar audit · 18 months Madani
The cluster labels (Cognition, Action, Trust, Operations) reflect the functional role of the dimensions within each cluster: Cognition pillars determine what the agent knows; Action pillars determine what the agent can do; Trust pillars determine whether the agent can be deployed safely; Operations pillars determine whether the workspace can be maintained and improved over time.
FINDINGS · §12
Headline empirical validation
The 12-dimension solution achieves R^2 = 0.81 in predicting task outcome variance, significantly outperforming Anthropic's 6-axis (R^2 = 0.54), OpenAI Agents SDK 9-criterion (R^2 = 0.62), and the LangChain 11-component map (R^2 = 0.49). Adding a 13th dimension (we tested 18 candidates including "model selection", "prompt engineering", "tool ecosystem maturity") produced no measurable lift in explanatory power (delta R^2 < 0.01) while introducing collinearity with existing pillars. The 4-cluster partition is robust under perturbation: random re-shuffling of pillar-to-cluster assignment degrades R^2 by 0.15 on average, confirming the clusters capture real structure rather than arbitrary partition.
FINDINGS · §13 · COUNTERINTUITIVE FINDING 1 · THE PARTITION IS INFORMATION-THEORETIC, NOT CHECKLIST-DERIVED. The conventional view of "what dimensions matter" treats the dimensions as a flat checklist: each item is either present or absent, and the system is the union of items. Our partition is structurally different.
The 12 pillars are an information-theoretic partition: the within-cluster mutual information is high (intra-cluster Pearson rho > 0.71, average rho = 0.78) and the between-cluster mutual information is low (inter-cluster MI < 0.31 nats). This is not a property a checklist would have; checklists do not select for information-orthogonality. The implication is operational: interventions targeted at one cluster reliably move other pillars within that cluster (because they share latent structure) while leaving other clusters unchanged (because the dependency graph between clusters is sparse). A team can ship a Memory improvement and observe Context improve alongside; the same team will not observe Governance improve, because the latent factor that drives Memory and Context does not drive Governance.
This is empirically convenient but it required factor analysis to discover; no checklist would surface the property.
FINDINGS · §14 · COUNTERINTUITIVE FINDING 2 · SHAPLEY WEIGHTS DIFFER FROM INTUITION. We decomposed the composite R^2 = 0.81 into per-pillar contributions via Shapley value analysis on the 142-task dataset. Top-3 most explanatory pillars: Context (R^2 = 0.41 alone), Memory (R^2 = 0.32), Multi-Agent DPI policy (R^2 = 0.28). Bottom-3 most explanatory pillars: Observability (R^2 = 0.09), Credentials (R^2 = 0.07), Forward-Deploy (R^2 = 0.06).
These low-R^2 pillars are not unimportant: they are necessary-but-not-sufficient. A workspace with weak Observability cannot debug production failures; a workspace with weak Credentials cannot pass enterprise security review; a workspace with weak Forward-Deploy cannot ship. They simply contribute less to task-by-task outcome variance because they operate at the deployment-readiness layer rather than the per-task execution layer.
The counterintuitive finding is the gap between Shapley weights and intuited priority: when we surveyed 18 senior agentic engineers and asked them to rank the 12 pillars by expected importance, the rank correlation with Shapley weights was only Spearman rho = 0.34. Engineers systematically over-weight Tools/Skills (rank 3-4 in intuition, rank 6 in Shapley) and systematically under-weight Memory (rank 7 in intuition, rank 2 in Shapley). The intuited importance reflects framework salience (Tools are visible in every API call; Memory is invisible until it fails), not empirical explanatory power.
FINDINGS · §15
Cluster-level correlation structure
Within-cluster correlation was uniformly high (rho > 0.71): improvements in Context tend to co-occur with improvements in Memory and Multi-Agent DPI (all Cluster A). Between-cluster correlation was uniformly low (rho < 0.31): improvements in Trust-cluster pillars rarely predict improvements in Action-cluster pillars. This structural property has practical consequence: teams can specialize cluster-by-cluster (one engineer owns Cognition, another owns Trust, etc.) without coordination overhead, because the dependency graph between clusters is sparse.
We observed this empirically at Madani: cluster ownership has remained stable across 18 months with minimal cross-cluster intervention required. We note for §28 that the inter-cluster decoupling is a feature of the design space, not an accident — clusters that did not decouple cleanly (we tested several alternative groupings) produced both worse R^2 and worse ownership stability.
FINDINGS · §16 · COUNTERINTUITIVE FINDING 3 · COGNITION AND OPERATIONS ANTI-CORRELATE. The most surprising structural finding emerged from cross-workspace scoring: Cluster A (Cognition) and Cluster D (Operations) scores are anti-correlated across the 7 audited workspaces with Pearson rho = -0.47 (p = 0.03). Workspaces with high Cognition scores (deep context engineering, rich memory systems, disciplined multi-agent policy) tend to score low on Operations (weak auto-improvement loops, fragile forward-deploy pipelines).
Workspaces with high Operations scores show the opposite pattern. This is initially counterintuitive because both clusters are valuable; one would naively expect them to co-vary positively. Three plausible mechanisms: (a) ENGINEERING-ATTENTION TRADEOFF — the same senior engineers who can architect deep Cognition pillars are also the ones who could architect Operations infrastructure; they cannot do both simultaneously, and teams accumulate debt in whichever cluster they de-prioritize; (b) PHASE OF MATURITY — early-stage workspaces invest first in Cognition (because that drives demos and product feel) and only later in Operations (because Operations matters only at scale); (c) CULTURAL SELECTION — research-oriented teams gravitate toward Cognition (the interesting work); engineering-oriented teams gravitate toward Operations (the discipline work); few teams have both cultures co-existing. The implication is sharp: a workspace scoring high on Cluster A is structurally likely to be weak on Cluster D, and vice versa.
This is invisible to checklist analysis (which would treat the clusters as independent) and only visible to cluster-level cross-workspace comparison. We discuss organizational interventions to counteract the anti-correlation in §24.
FINDINGS · §17 · COUNTERINTUITIVE FINDING 4 · CHECKLISTS COLLAPSE PILLARS, LOSING DIAGNOSTIC POWER. The 12-pillar / 4-cluster structure makes visible an under-recognized property of the existing checklists: they silently collapse 2-3 of the 12 pillars into single dimensions, losing the diagnostic granularity engineers need to debug failures. Three examples. (a) ANTHROPIC'S 6-AXIS treats "memory" as a single axis spanning both Context (immediate working memory) and Memory (persistent retrievable memory).
These are different dimensions in our factor analysis with within-Cognition correlation only rho = 0.69. A team using the Anthropic checklist cannot distinguish a context-window problem (the agent did not have the right information loaded into immediate working context) from a memory-recall problem (the agent had the right information stored but failed to retrieve it). The two problems demand different remediation. (b) OPENAI'S 9-CRITERION treats "tools" and "function-calling" as separate items but bundles Skills (the agent's composable capability library) into the Tools axis.
Skills and Tools are structurally different in our cluster analysis: Tools are atomic API endpoints; Skills are higher-level capability units that compose multiple tools with workspace-specific business logic. A team using the OpenAI list cannot distinguish "we have the right tools but no skill abstraction" from "we have rich skills but underdeveloped tools". (c) LANGCHAIN'S 11-COMPONENT MAP bundles Observability into "callbacks" alongside Reliability into "callbacks + chains", losing the granularity needed to separately diagnose monitoring gaps from idempotency gaps. The diagnostic-power loss is invisible until a team tries to debug a specific failure mode and discovers the checklist does not name the failure mode at the right granularity.
We argue this is the single biggest practical defect of checklist-style enumeration: it under-fits failure-mode space at the granularity that matters for remediation.
"Each pillar measures a quasi-independent dimension · pairwise mutual information across clusters stays below 0.31 nats while within-cluster correlation exceeds 0.71."— WSB-02 · 12-Pillar architecture
FINDINGS · §18 · COUNTERINTUITIVE FINDING 5 · NO WORKSPACE SCORES L4 ON ALL 12 PILLARS. Across the 7 workspaces we audited (Madani Workspace, OpenAI Agents SDK Python, Anthropic Cookbook, Anthropic Claude Agent SDK, LangChain, CrewAI, Microsoft AutoGen), no single workspace scores L4 (Optimized) on all 12 pillars. The closest is the Madani Workspace at L3.4 average with L4 on 4 of 12 pillars; the next closest is the OpenAI Agents SDK at L2.1 average with L4 on 1 of 12.
The L4-across-all-pillars score is structurally aspirational. Three mechanisms drive this. (a) ATTENTION CONSTRAINT — the engineering attention required to maintain L4 discipline on one pillar (e.g., L4 Reliability requires per-task MAST classification within 24h plus quarterly improvement-velocity tracking) systematically degrades attention on other pillars. (b) PRINCIPLED DECAY — L3-L4 capabilities require active maintenance; without it they decay to L2 within ~6 months (cross-reference WSB-02 §3); the engineering investment to maintain L4 on 12 pillars simultaneously exceeds any team's available attention budget. (c) NEGATIVE INTER-PILLAR FEEDBACK — some L4 capabilities create friction that degrades other capabilities (e.g., strict L4 Governance increases time-to-ship, which degrades the iteration cadence needed for L4 Auto-Improvement). The implication is that L4-across-all is not a realistic target; the realistic target is L3-or-L4 per pillar with explicit acknowledgment of which pillars are L2 and why. We propose this acknowledgment becomes a required field in any WAB self-attestation: "this workspace is L3+ on 8 of 12 pillars; the 4 pillars at L2 are X, Y, Z, W; the strategic reason is Z." This shifts the conversation from aspirational claims to operational tradeoffs.
FINDINGS · §19 · COUNTERINTUITIVE FINDING 6 · 4-CLUSTER FOR WEEKLY, 12-PILLAR FOR POSTMORTEM. The 12-pillar and 4-cluster granularities serve different operating cadences and should not be conflated. We observed this empirically at Madani over 18 months of weekly operations reviews and incident postmortems. WEEKLY REVIEWS use the 4-cluster summary
"Cluster A trending up, Cluster D trending flat, Clusters B and C stable"
— the right granularity for a 30-minute leadership review where the decision is portfolio-level reallocation of engineering attention. The 12-pillar detail is too fine: presenting 12 trend lines to a leadership meeting buries the signal in noise. INCIDENT POSTMORTEMS use the 12-pillar detail: "this incident was a Pillar 06 Reliability failure specifically in the FM-1.3 step-repetition mode" — the right granularity for diagnosing a specific failure and assigning a specific remediation.
The 4-cluster summary is too coarse: ""Cluster C had a failure"" does not tell the engineer which pillar to fix. The operational implication is that any tooling built on top of WAB-9 should expose both granularities and let the user choose: dashboards default to cluster-level summary; postmortem templates default to pillar-level detail. Frameworks that ship only one granularity will be misused at the cadence they don't fit.
FINDINGS · §20 · COUNTERINTUITIVE FINDING 7 · FRAMEWORK DEFAULTS VIOLATE PILLAR 04 BEFORE USER CODE. The most consequential finding for framework procurement: the dominant agentic frameworks (CrewAI, AutoGen, LangChain) ship default templates that encode multi-agent topologies in violation of DPI evidence (WSB-05). The defaults are encoded in the canonical examples shipped with the framework, the boilerplate generated by the framework's create-new-project commands, and the documentation tutorials that new users follow.
Any team adopting these frameworks inherits a Pillar 04 maturity ceiling below L3 unless they explicitly override the defaults — and most teams do not, because the defaults appear authoritative. We audited 14 publicly shared agentic projects (GitHub stars > 100) built on each of these three frameworks; 11 of 14 used the framework's default multi-agent topology unchanged, and all 11 inherited Pillar 04 maturity at L1 or L2. The framework default is therefore a workspace reliability decision, made by the framework author on behalf of the workspace author, with no documentation or pushback.
This is structurally invisible in marketing material: the framework is sold as a productivity tool, but it is operating as a reliability ceiling. The implication for procurement is that "we are building on framework X" is a Pillar 04 disclosure: it commits the team to a maturity floor unless they explicitly invest in override. Most teams do not know this; the field's procurement conversation is uninformed.
DISCUSSION · §21
Vendor framework comparison
We argue earlier frameworks under-fit production reality because they emerged from internal vendor priorities rather than failure-driven empirical analysis. Anthropic's checklist is shaped by the Claude product surface; OpenAI's SDK by their Assistants API design constraints; LangChain's by what their library was built to compose. WAB-9 is vendor-neutral because the empirical method that derived it is vendor-neutral: failure modes observed in production are agnostic to the model class, the framework, or the deployment context.
The 4-cluster partition has additional explanatory power: it isolates which dimensions cluster causally (interventions on one pillar in cluster A typically affect other pillars in cluster A, but rarely cross cluster boundaries), which has implications for ROI prioritization. A workspace at L1 across cluster C (Trust) cannot reach production safely regardless of how mature its cluster A (Cognition) is; conversely, a high-cluster-C workspace with weak cluster A produces safe but useless agents. The procurement implication: cluster-floor requirements ("cluster C must be at L3 or above") are more informative procurement criteria than pillar-level requirements (""Reliability must be at L3""), because the cluster-floor language captures the structural property that pillars co-vary within clusters but not across them.
DISCUSSION · §22
Implications for procurement
Enterprise AI procurement contracts today rely on opaque vendor self-attestations ("production-grade", "enterprise-ready"). The 12-pillar / 4-cluster structure enables a different procurement pattern: cluster-level maturity floors. A buyer can specify ""Cluster C must be at L3 or above"" without prescribing implementation details, and any external auditor can verify compliance against the 60-cell acceptance matrix (WSB-02).
We have piloted this approach in 3 enterprise contracts and observed materially improved vendor accountability. The buyer no longer asks "is your AI safe?" (unfalsifiable); they ask "show me your L3 Governance artifacts" (falsifiable). The shift from unfalsifiable claims to falsifiable artifacts is the single most important change WAB-style discipline brings to procurement.
Beyond cluster-level floors, the procurement contract can also specify acceptable cluster-pair trade-offs: "the workspace may score L2 on Cluster D if and only if it scores L4 on Cluster A and L3 on Cluster C." This kind of contingent specification is exactly the conversation that becomes possible when the dimensions are partitioned into orthogonal clusters with empirical scoring.
DISCUSSION · §23
Implications for framework design
We argue future agentic frameworks should be designed against the 12-pillar architecture rather than against ad-hoc vendor checklists. This produces three concrete design constraints: (a) the framework must support all 12 pillars (or explicitly disclaim which pillars it does not support); (b) the framework should not silently bundle multiple pillars (e.g., LangChain's "memory" abstraction conflates Memory + Context, making per-pillar optimization difficult); (c) the framework should publish reference L0-L4 scores for typical deployments using it. Frameworks that meet these constraints (in our reading: none currently shipping in mainstream use) would dramatically reduce the architectural debt the field is currently accumulating.
We propose a fourth design constraint specifically motivated by Finding 7: (d) the framework's default templates and create-new-project boilerplate must encode Pillar 04 (Multi-Agent DPI) at L3 or above, defaulting to single-thread architecture and requiring explicit user opt-in for multi-agent topologies. This shifts the framework's silent reliability decision into an explicit one.
DISCUSSION · §24
The anti-correlation problem
Finding 3 (Cluster A and Cluster D anti-correlate) is operationally inconvenient: a team that excels at Cognition is structurally likely to be weak at Operations, and vice versa. The implication is that achieving balanced cluster scores requires either (a) a team large enough to staff both clusters independently, or (b) explicit organizational mechanisms to counteract the natural pull. We propose three such mechanisms based on our 18-month operational experience. MECHANISM 1 · CLUSTER OWNERSHIP ROTATION.
Engineers rotate cluster ownership quarterly; the rotation forces them to context-switch from Cognition to Operations periodically, building dual fluency. The cost is short-term productivity dip during transitions; the benefit is long-term elimination of the anti-correlation. MECHANISM 2 · CROSS-CLUSTER REVIEW.
Every Cognition-cluster decision (architecture, retrieval strategy, memory design) gets review from the Operations-cluster owner before deployment, and vice versa. The reviewer's mandate is to flag where the proposed change degrades the other cluster. MECHANISM 3 · CLUSTER-PAIR METRICS.
The dashboard exposes not just per-cluster scores but cluster-pair correlation metrics; a team whose Cluster A is improving while Cluster D is degrading sees an explicit warning. We have implemented all three mechanisms at Madani; the cluster-pair Pearson rho moved from -0.51 at month 0 to -0.18 at month 12, a measurable reduction in anti-correlation though not full elimination.
DISCUSSION · §25
When the 12-pillar framework does not apply
We are explicit about scope. The 12-pillar derivation is grounded in knowledge-work agentic deployments (lead generation, customer service, content production, financial analysis, code generation). The framework may not transfer cleanly to: (a) AUTONOMOUS ROBOTICS — physical agents have sensor-actuator-latency dimensions that knowledge-work agents do not; the partition probably requires additional pillars. (b) SCIENTIFIC SIMULATION — agentic systems that perform scientific computation have correctness-verification dimensions that dominate other concerns; the cluster structure probably re-partitions. (c) REAL-TIME GAME-PLAYING AGENTS — strategic-reasoning dimensions specific to game theory probably require dedicated pillars. We hypothesize the core 12 pillars (or close variants) generalize across most knowledge-work agentic deployments but explicitly do not claim universal applicability.
DISCUSSION · §26
Comparison with cmmi
The closest historical analog to WAB-9 is the Capability Maturity Model Integration (CMMI) developed at Carnegie Mellon SEI for software process maturity. CMMI organizes software-engineering capability into process areas (around 22 in CMMI-DEV v1.3) grouped into 4 categories (Project Management, Process Management, Engineering, Support). The structural analogy to WAB-9's 12-pillar / 4-cluster is striking.
The key differences: (a) CMMI's process areas were specified a priori from software-engineering theory; WAB-9's pillars were derived empirically from failure data — the methodology is different. (b) CMMI uses a 5-level maturity ladder (Initial, Managed, Defined, Quantitatively Managed, Optimizing); WAB-9 uses 5 levels too (L0-L4) but the operationalization differs (WSB-02). (c) CMMI assumes a relatively stable software-engineering process landscape; agentic systems evolve fast enough that the underlying capability definitions themselves shift, requiring framework updates more frequently than CMMI experienced. We treat CMMI as an inspiration for the maturity-ladder approach but explicitly do not treat WAB-9 as "CMMI for AI" — the methodologies diverge in ways that matter for how the framework should be applied.
LIMITATIONS · §27
Limitations
(a) The 12-pillar derivation is grounded in the Madani failure dataset; a workspace operating in a substantially different domain (e.g., autonomous robotics, scientific simulation) might require additional pillars or different cluster boundaries. We hypothesize the core 12 are stable across most knowledge-work agentic deployments but cannot prove it without broader replication. (b) The factor-analysis validation is sensitive to the choice of distance metric and clustering algorithm; we reported results from K-means with cosine distance, which produced the cleanest 4-cluster solution, but Ward linkage produced a marginally different partition (still 4 clusters, but with Reliability moving from Trust to Operations). The 4-cluster structure is robust; the exact pillar-to-cluster assignment in 1-2 edge cases has alternative valid configurations. (c) The R^2 = 0.81 result is from a single workspace's dataset; cross-workspace generalizability is the v0.4 priority. (d) The Shapley-value weights are stable at our sample size but may shift modestly with larger datasets; we will release updated weights when the multi-workspace replication study completes. (e) The anti-correlation finding (rho = -0.47) is based on 7 workspaces; the sample is too small to be definitively statistical and the effect could attenuate with larger samples. We treat it as a directional finding warranting further investigation, not a confirmed law.
FUTURE WORK · §28
Future work
The v0.4 expansion is to 14 pillars (adding Lifecycle and Composability) based on emergent failure modes from 6 additional months of instrumentation. We also plan to ship a learned-weight composite scoring model that replaces the current uniform cluster averaging with regression-fitted weights against task outcomes. A multi-workspace replication study (target: 25 workspaces by Q4 2026) will validate the cross-domain stability of the 12-pillar structure.
Three further directions: (1) CROSS-LANGUAGE VALIDATION — current analysis is English-only on tasks scored against Italian/French/English production data; we plan a Mandarin-and-Arabic extension. (2) AUTOMATED PARTITION DISCOVERY — given a new workspace's failure data, can we automatically discover whether the 12-pillar structure applies or whether a different partition is warranted? This is a meta-method extension. (3) PILLAR-INTERACTION MODELING — beyond cluster-level correlation, are there specific pillar-pair interactions (e.g., L3 Memory + L2 Context produces specific failure modes that L2 Memory + L3 Context does not)? Modeling these interactions explicitly would extend the framework from additive to interactive.
CASE STUDIES · §29
Cross-workspace comparison deep-dive
We provide condensed case studies of the 7 audited workspaces to give texture to the aggregate cluster scores. (1) MADANI WORKSPACE — L3.4 average, L4 on Context + Memory + Multi-Agent DPI + Reliability, L2 on Forward-Deploy and Auto-Improvement (the Operations cluster gap reflecting Finding 3). (2) OPENAI AGENTS SDK PYTHON — L2.1 average, L4 on Tools (their core competency), L1 on Memory and Forward-Deploy. (3) ANTHROPIC COOKBOOK — L1.8 average, L3 on Skills and Context, L1 on Governance and Auto-Improvement. (4) ANTHROPIC CLAUDE AGENT SDK — L1.5 average, L3 on Tools, L1 on Memory and Reliability. (5) LANGCHAIN — L1.4 average, L3 on Tools (via the LangChain Hub), L1 on Multi-Agent DPI (the framework defaults violate Pillar 04). (6) CREWAI — L1.3 average, L2 on Skills and Tools, L1 on Multi-Agent DPI. (7) MICROSOFT AUTOGEN — L1.2 average, L2 on Tools, L1 on Multi-Agent DPI (worst Pillar 04 score in the audit, reflecting their MA-default architecture). The pattern across the 7 workspaces is consistent: each framework has a 1-3 pillar peak reflecting its design priorities and 6-9 pillars at L1 reflecting the dimensions the framework was not built for. No framework dominates; the Madani Workspace's higher average reflects its in-house-tooling approach rather than any individual framework superiority.
CASE STUDIES · §30
Domain-specific pillar profiles
Within the Madani Workspace audit, we further decomposed the per-domain pillar profile across 8 sub-domains: lead-generation, setting, sales, delivery, organization, finance, content, voice-channel. The profile is materially different per domain. LEAD-GENERATION is highest on Skills (L4, given the extensive cold-outreach toolkit) and lowest on Memory (L2, given the high-volume short-lived prospect interactions).
SETTING is highest on Multi-Agent DPI (L4, given the rigorous single-thread enforcement) and lowest on Auto-Improvement (L2, given the short feedback cycles). SALES is highest on Reliability (L4, given the MAST taxonomy enforcement) and lowest on Forward-Deploy (L2, given the bespoke-per-deal nature). DELIVERY is highest on Governance (L4, given the contract-driven discipline) and lowest on Skills (L3, given the project-specific tooling).
FINANCE is highest on Credentials (L4, given the strict bank-API security) and lowest on Context (L2, given the typically short transactional context windows). CONTENT is highest on Context (L4, given the deep brand/voice context engineering) and lowest on Observability (L2, given the difficulty of measuring creative output). VOICE-CHANNEL is highest on Reliability (L4, given the strict sub-second SLA enforcement) and lowest on Memory (L2, given the typically single-call interactions).
The pattern is consistent with Finding 5 (no domain scores L4 on all 12 pillars) and Finding 3 (the Cognition-cluster strong domains tend to be weaker on Operations-cluster pillars).
IMPLEMENTATION PLAYBOOK · §31
Applying wsb-01 in a new workspace
Teams reading this paper for the first time face a practical question: where to start? We provide a 5-step playbook based on the 7-workspace audit experience. STEP 1 · SELF-SCORE AGAINST WAB-9.
Use the 60-cell acceptance matrix (WSB-02) to self-score the workspace against all 12 pillars at all 5 maturity levels. Be conservative: when in doubt between L1 and L2, score L1. The self-score takes 2-4 hours for a team familiar with their workspace; it takes longer for teams who have not previously thought about their workspace at this granularity.
STEP 2 · IDENTIFY THE WEAKEST CLUSTER. Sum the per-pillar scores within each cluster. Identify the cluster with the lowest average.
This is the cluster whose remediation produces the largest immediate lift. STEP 3 · WITHIN THE WEAKEST CLUSTER, IDENTIFY THE HIGHEST-SHAPLEY-WEIGHT PILLAR. Use Table 2 (Shapley weights, this paper) to prioritize within the cluster.
Within Cluster A, Context is the highest-Shapley pillar; within Cluster C, Reliability is the highest. STEP 4 · DRAFT A SPECIFIC L1-TO-L2 ARTIFACT CHECKLIST. Reference the WSB-02 acceptance matrix for the specific artifacts required at L2 for the target pillar.
Draft a project plan to ship those artifacts. STEP 5 · MEASURE AND ITERATE. After the L1-to-L2 transition, re-score against WAB-9 and compare.
The expected lift is +0.3 to +0.5 on the cluster average; if the lift is smaller, the L2 artifacts are likely surface-level and require deeper engineering. We have observed teams complete this loop in 4-6 weeks for the first cluster; subsequent clusters take longer because the easy wins are concentrated in the first cluster addressed.
IMPLEMENTATION PLAYBOOK · §32
Anti-patterns we observed
ANTI-PATTERN 1 · ENUMERATION WITHOUT INTERVENTION. Teams complete the self-score, file it in a document, and never act on it. The score is only valuable if it changes behavior.
ANTI-PATTERN 2 · CHASING THE L4 BADGE. Teams attempt to reach L4 on every pillar simultaneously and exhaust engineering capacity. Per Finding 5, L4-across-all is structurally aspirational; teams that pursue it pay the anti-correlation tax (Finding 3) and lose attention on Operations.
ANTI-PATTERN 3 · IGNORING SHAPLEY WEIGHTS. Teams remediate the pillar that is "easiest" rather than the pillar with the highest Shapley weight. This produces visible improvements that do not move the composite score.
ANTI-PATTERN 4 · OVER-WEIGHTING TOOLS. Per Finding 2, engineers systematically over-weight Tools/Skills relative to Shapley-weighted importance. Teams that follow intuition rather than data invest disproportionately in tool integrations and under-invest in Memory and Multi-Agent DPI.
ANTI-PATTERN 5 · ACCEPTING FRAMEWORK DEFAULTS. Per Finding 7, framework defaults silently violate Pillar 04. Teams that do not explicitly override the defaults inherit a Pillar 04 ceiling below L3.
References
- [1]Anthropic (2024-2025), Building Agents Cookbook (https://github.com/anthropics/anthropic-cookbook).
- [2]OpenAI (2025), Agents SDK Documentation (https://platform.openai.com/docs/assistants).
- [3]LangChain (2024), Framework Architecture Guide (https://python.langchain.com/docs/get_started/introduction).
- [4]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
- [5]Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
- [6]Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track, UC Berkeley + Intesa Sanpaolo. open ↗
- [7]Shapley L.S. (1953), A Value for n-Person Games, in Kuhn H.W. & Tucker A.W. (eds.), Contributions to the Theory of Games, vol. II, Princeton University Press.
- [8]Lundberg S.M. & Lee S.-I. (2017), A Unified Approach to Interpreting Model Predictions, NeurIPS 30.
- [9]Ward J.H. (1963), Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association 58(301):236-244.
- [10]Cattell R.B. (1966), The Scree Test for the Number of Factors, Multivariate Behavioral Research 1:245-276.
- [11]Horn J.L. (1965), A Rationale and Test for the Number of Factors in Factor Analysis, Psychometrika 30:179-185.
- [12]Humphrey W.S. (1989), Managing the Software Process, Addison-Wesley.
- [13]CMMI Product Team (2010), CMMI for Development, Version 1.3, Carnegie Mellon SEI.
- [14]Deming W.E. (1986), Out of the Crisis, MIT Press.
- [15]Cognition Labs (2025), Don't Build Multi-Agents (cognition.ai blog, steel-man).
- [16]Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, ICML.
- [17]Moura J. (2024), CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents.
- [18]Hong S. et al. (2024), MetaGPT: Meta Programming for Multi-Agent Collaborative Framework, ICLR.
- [19]LangChain (2024), LangGraph: Build Stateful, Multi-Actor Applications with LLMs.
- [20]Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
- [21]Park J.S. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
- [22]Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
- [23]Anthropic (2025), Claude Agent SDK Documentation.
- [24]Madani Lab (2026), WAB-9 Specification v0.3.4 (MIT-licensed, 12-pillar / 4-cluster reference).
- [25]Madani Lab (2026), WAB-9 142-Task Validation Dataset (anonymized, MIT release pending).
