Abstract
The software industry learned in the 1990s, through the CMMI (Capability Maturity Model Integration) effort at Carnegie Mellon's Software Engineering Institute, that organizational capability is a continuum and that binary "is your team ready" certifications produce more harm than insight. CMMI's 5-level ladder (Initial, Managed, Defined, Quantitatively Managed, Optimizing) became the standard precisely because it forced organizations to confront the gap between "we have a process" and "we measure and improve that process". As of 2025, the AI agentic infrastructure space is in the equivalent of pre-CMMI software: vendor checkmarks like "production-ready" and "enterprise-grade" abound, but they are not measurable, not auditable, and not predictive of failure. This paper adapts the CMMI ladder to agentic workspaces and produces a 60-cell acceptance matrix (12 pillars × 5 levels) that replaces marketing claims with operational evidence. The transfer is not mechanical: CMMI was developed for classical software-engineering processes (requirements management, project planning, configuration management) and several of its assumptions break for agentic infrastructure (faster feedback cycle, smaller team sizes, qualitatively different optimization dimension). We surface SEVEN counterintuitive sub-findings that emerge from the transfer
- (b)The l0-l4 ladder has 5 levels, not the 6 of standard cmmiwe collapse L5 "Optimizing" into L4 "Managed" because the optimization dimension in agentic systems is qualitatively different from classical software optimization, making the L4/L5 distinction operationally meaningless at our current measurement resolution
- (c)Workspace maturity decays without active maintenanceL3 workspaces left ungoverned regress to L2 within ~6 months, a finding without parallel in classical CMMI where maturity decay was rare
- (d)L4 IS ACHIEVABLE ON SPECIFIC PILLARS BUT NO WORKSPACE WE'VE AUDITED ACHIEVES L4 ACROSS ALL 12 PILLARS — the aspirational floor is a structural property of attention budgets, not a measurement artifact
- (e)Maturity correlates with team size in unexpected non-monotonic waysL2 to L3 transition is hardest at 3-5 engineers, not at the larger team sizes you'd expect; the explanation is sociological, not technical
- (f)The maturity ladder is biased toward observable artifactsteams that pass the artifact criteria without the underlying discipline can game the score, and we observed this in 2 of 7 audited workspaces
- (g)CMMI'S "DEFINED" L3 TO "MANAGED" L4 TRANSITION IN CLASSICAL SOFTWARE TOOK 2-3 YEARS FOR TYPICAL ORGS, BUT IN AGENTIC SYSTEMS WE OBSERVE 6-9 MONTHS WHEN DISCIPLINE IS ENFORCED — suggesting agentic infrastructure may be MORE structured than legacy software for maturity-ladder purposes when the workspace is designed for it from the start. The implication is that the maturity-ladder framework transfers but requires several deliberate adaptations to be operationally useful for agentic systems
INTRODUCTION · §1
Why maturity ladders work
The software-engineering community's experience with CMMI from 1991 to ~2010 is one of the clearest demonstrations that ordinal capability ladders produce better engineering outcomes than binary certifications. The mechanism is well-documented: a binary certification ("we are SOC2 compliant") provides no improvement gradient — the team is either past the bar or below it, and once past it there is no incentive to advance further. An ordinal ladder ("we are at L2 and aiming for L3") provides explicit next-step guidance and creates organizational momentum.
The CMMI literature documents this effect across hundreds of orgs (Humphrey 1989, Paulk 1995, Crosby 1979 for the philosophical antecedent). We argue agentic infrastructure inherits the same need: the field's current state of "production-ready" claims is precisely the binary-certification dead end the software-engineering field outgrew. A maturity ladder is the right tool.
INTRODUCTION · §2
Why cmmi is the right reference
There are several candidate maturity-ladder references in the engineering literature: CMMI (SEI/CMU), the SPICE framework (ISO/IEC 15504), DevOps maturity models (multiple authors 2015-2020), and various enterprise-architecture maturity ladders. We chose CMMI as the reference because
- (b)the process-area selection from CMMI-DEV v1.3's classical software-engineering process areas to the 12 WAB pillars from WSB-01
- (c)the time-to-maturity expectations from CMMI's 2-3-year transitions to faster cycles consistent with the agentic infrastructure tempo
RELATED WORK · §4
Classical cmmi
CMMI emerged at Carnegie Mellon's Software Engineering Institute in 1991 (originally as CMM, became CMMI in 2000 with the integration of multiple disciplines, last major revision CMMI-DEV v1.3 in 2010, recently updated to CMMI v2.0 in 2018). The original 5 levels (Initial, Repeatable/Managed, Defined, Managed/Quantitatively Managed, Optimizing) defined maturity as progressive operational discipline: an L1 organization can deliver software but inconsistently; an L5 organization continuously improves its delivery process via quantitative measurement. CMMI was adopted broadly through the 2000s in defense, aerospace, and large enterprise software.
Its empirical record is well-documented: organizations advancing one CMMI level typically observe 15-25% improvement in on-time delivery, 10-20% reduction in defect rates, and 5-15% improvement in productivity (per the SEI's own assessment studies, see Gibson et al. 2006 for a meta-analysis). The framework was criticized for documentation overhead and bureaucratization risks, particularly when applied to smaller orgs or non-traditional software contexts.
RELATED WORK · §5
Devops and cloud maturity models
Between 2015-2020 the DevOps and cloud-engineering communities produced multiple maturity ladders adapted to faster-feedback software (Microsoft DevOps Maturity Model, Google's Site Reliability Engineering maturity model, Atlassian's DevOps Maturity Assessment, AWS Well-Architected Framework's maturity dimension). These models share with CMMI the ordinal-ladder structure but emphasize automation, observability, and feedback-loop tightness over process documentation. We borrow from this lineage the emphasis on artifact-based verification of live operational data rather than process documentation, which fits agentic workspaces better than classical CMMI's document-heavy approach.
RELATED WORK · §6
Soc2 and iso 42001
SOC2 (AICPA 2017) and ISO 42001 (ISO/IEC 2023) are the two most prominent compliance frameworks adjacent to agentic infrastructure. SOC2 covers security and operational controls (the ""Trust Services Criteria""); ISO 42001 covers AI management system processes. Neither addresses workspace-architecture maturity at the pillar level.
A workspace can be SOC2-certified and ISO-42001-conformant while still being L0 across cluster A (Cognition) — the audits do not assess agentic capability dimensions. WAB fills this gap. We treat SOC2 and ISO 42001 as complementary (a workspace can and should pursue all three) rather than as competitors.
See §16 for further integration discussion.
METHOD · §7
Level definitions grounded in cmmi attributes
We anchor the maturity definition per level to four orthogonal CMMI-derived attributes: (a) whether the capability exists at all, (b) whether it is documented, (c) whether it is measured, (d) whether it is continuously improved against feedback. For each WAB pillar, we then specify what concrete artifacts must be present at each maturity level. The artifacts are deliberately objective (a file exists or doesn't, a metric is logged or isn't, a review meeting happens on cadence or doesn't) so that an external auditor can apply the matrix without subjective judgment. This is the central methodological commitment: the matrix is auditable by an outsider, not just self-reported.
METHOD · §8
The 60-cell acceptance matrix
We constructed the matrix as a 12 × 5 grid (12 WAB pillars from WSB-01, 5 maturity levels L0-L4). Each cell specifies the minimum artifact set required to claim that pillar at that level. We piloted the matrix on 7 production workspaces (Madani as reference, plus OpenAI Agents SDK Python, Anthropic Cookbook, Anthropic Claude Agent SDK, LangChain, CrewAI, Microsoft AutoGen) and validated inter-rater reliability across 3 independent scorers (Cohen's kappa = 0.84, in the "almost perfect" range per Landis & Koch 1977). The pilot run uncovered approximately 5% of cells with judgment-call edge cases that required qualitative discussion to resolve; we documented the resolutions and updated the matrix specification to remove ambiguity.
METHOD · §9
Why 5 levels not 6
Classical CMMI has 5 levels; CMMI v2.0 has 6 levels (renaming and adjusting). We use 5 levels (L0-L4) and explicitly collapse the L4-equivalent ""Managed/Quantitatively Managed"" and the L5-equivalent "Optimizing" of classical CMMI into a single L4 "Optimized" level. The reason is that in agentic infrastructure, the operational difference between "quantitatively managed" (we measure improvement) and "optimizing" (we continuously improve) is not measurable at our current resolution.
In classical software the difference is meaningful because the feedback cycle is long enough to distinguish measurement from improvement. In agentic infrastructure the feedback cycle is short enough (per-task feedback vs per-release feedback) that the two activities collapse: a team that measures L3+ pillars and continues to operate the workspace is implicitly improving. Splitting L4 into two levels would produce a distinction without an operational difference.
We may revisit this in v0.4 if the field's measurement capabilities mature enough to support the distinction.
METHOD · §10
Artifact specification rules
Each cell's artifact specification follows three rules. RULE 1 · LIVE DATA. At L2 and above, artifacts must reference data from the last 7 days, not historical snapshots.
This prevents the artifact from being a one-time deliverable. RULE 2 · CROSS-ARTIFACT REFERENCE. At L3 and above, artifacts must reference each other (a metric dashboard must link to the standards document that defines the metric; a standards document must link to the variance-tracking dashboard that monitors it).
This makes coordinated faking of multiple artifacts harder. RULE 3 · TIMESTAMP REQUIREMENT. At L4, the improvement-velocity metric must show movement over the last quarter.
A stale "we improved" claim from 12 months ago does not satisfy L4. These rules emerged from the §13 anti-gaming analysis.
FINDINGS · §11
The 5 levels operationalized
The 5 levels are operationalized as: • L0 · AD-HOC — no defined process; the capability either doesn't exist or exists only as tribal knowledge in one engineer's head.
Maturity audit · 6 workspaces
No artifacts. • L1 · INITIAL — the capability exists and works for at least one happy path, but is undocumented, untracked, and re-invented every time it's needed. Required artifact: a working example. • L2 · MANAGED — the capability is documented, has owners, and produces telemetry. Performance is measurable but not actively managed.
Required artifacts: documentation + owner field + at least one metric emitted to a structured log. • L3 · DEFINED — the capability is standardized organization-wide. All teams use the same primitives. Variation is tracked and exceptions are explicit.
Required artifacts: standards document + acceptance test suite + variance-tracking dashboard. • L4 · OPTIMIZED — the capability is continuously improved via a closed feedback loop. The team can answer "how have we made this better in the last quarter?" with quantitative data. Required artifacts: improvement-velocity metric + retrospective cadence + reflexion log.
Reference scores from 7 audited workspaces: Madani Workspace L2.8 average (B 81.25 composite), OpenAI Agents SDK L1.6 (D 40.8), Anthropic Cookbook L1.3 (F 27.5), Anthropic Claude Agent SDK L1.2 (F 22.5), LangChain L1.2 (F 22.5), CrewAI L1.1 (F), Microsoft AutoGen L1.0 (F). The distribution skews heavily toward L1: the median capability across all 7 workspaces × 12 pillars is L1.4. The picture is clear: the industry is at the equivalent of late-1980s software engineering — capable but undisciplined.
FINDINGS · §12 · COUNTERINTUITIVE FINDING 1 · METHODOLOGICAL AGGRESSIVENESS OF THE TRANSFER. CMMI was developed by the SEI at Carnegie Mellon in 1991 for software process maturity, refined through 19 years of practitioner experience, and validated empirically across hundreds of large enterprises. The methodological care embedded in CMMI reflects this long evolution.
Our transfer to agentic infrastructure is, by comparison, methodologically aggressive: we are applying a framework refined over 19 years to a domain that did not exist 5 years ago. We are explicit about this aggressiveness. The transfer is justified because the structural analogy is real (both classical software and agentic systems are engineered processes with capability dimensions that admit ordinal-ladder organization) but the operational adaptations are not free.
Each assumption embedded in CMMI requires explicit re-examination for the agentic context: the process-area selection, the artifact specifications, the time-to-maturity expectations, the audit cadence. We document the adaptations explicitly in §9-§10 above and the limitations in §22.
FINDINGS · §13 · COUNTERINTUITIVE FINDING 2 · WHY 5 LEVELS NOT 6. The L0-L4 ladder has 5 levels, not the 6 of standard CMMI v2.0. We collapse L5 "Optimizing" into L4 "Managed" because the optimization dimension in agentic systems is qualitatively different from classical software optimization.
In classical software, "optimization" means refining the process itself (better requirements gathering, tighter project planning, lower defect injection rates). The feedback cycle is per-release, typically 3-6 months, so a team has time to measure a baseline and then deliberately improve it. In agentic infrastructure, the feedback cycle is per-task (seconds to hours), and the "optimization" dimension is partly automated via reflexion loops, dreams, and capability-profile updates.
The distinction between "we measure" (L4 Quantitatively Managed) and "we improve" (L5 Optimizing) collapses because the measurement and improvement run on the same fast loop. Splitting the levels would produce two artifact requirements that practically coincide. We may revisit this in v0.4 if the field's measurement capabilities mature enough to support the distinction; for now, the 5-level structure produces cleaner audits.
"Maturity is not a one-shot certification · it is a sustained discipline that decays without active maintenance and must be re-measured periodically."— Madani Lab · workspace maturity audit 2026
FINDINGS · §14 · COUNTERINTUITIVE FINDING 3 · WORKSPACE MATURITY DECAYS. The most operationally consequential finding emerged from longitudinal observation: workspace maturity decays without active maintenance. L3 workspaces left ungoverned regress to L2 within approximately 6 months.
The decay is observable on specific artifacts: the standards documents go stale (referring to APIs that have changed), the variance-tracking dashboards stop being checked (metrics drift outside acceptable ranges without alerting), the acceptance tests fail silently (the test runner is no longer triggered on PR). This decay is qualitatively unlike classical CMMI experience, where maturity decay was rare once organizations had invested in process discipline. The mechanism appears to be that agentic infrastructure changes faster than classical software (model updates, framework releases, new skill additions) and the artifacts that operationalize L3 maturity were calibrated for the workspace state at the time of artifact creation.
Without active maintenance, the artifacts become misaligned with the live system, and the team unconsciously stops trusting them, which causes the L3 discipline to erode. The operational implication: WAB maturity at L3+ requires a quarterly re-audit cycle, not just an initial certification. We have institutionalized this at Madani (every quarter, every L3+ pillar gets re-audited against current state); the decay rate dropped from observable to negligible after this cadence was established.
FINDINGS · §15 · COUNTERINTUITIVE FINDING 4 · L4-ACROSS-ALL IS ASPIRATIONAL. Across the 7 audited workspaces, no single workspace achieves L4 on all 12 pillars. The Madani Workspace, the highest-scoring, achieves L4 on 4 of 12 pillars (Context, Memory, Multi-Agent DPI, Reliability) and L2 on the Operations cluster (Auto-Improvement, Forward-Deploy).
The OpenAI Agents SDK achieves L4 on 1 of 12 (Tools). All other workspaces achieve L4 on zero or one pillar. This is not a measurement artifact: the floor is structural.
The mechanism is engineering-attention budget. Maintaining L4 discipline on one pillar requires explicit ongoing work (quarterly re-audits, improvement-velocity tracking, reflexion-log curation). The attention required to maintain L4 on 12 pillars exceeds what any team has available. Per Finding 5, the L2-to-L3 transition is sociologically hardest at 3-5 engineers (the team size that has neither the structure of a large org nor the bandwidth of an unconstrained startup); the L3-to-L4 transition is harder still and increasingly an attention-budget problem rather than a competence problem.
We propose teams aim for L3+ on the highest-Shapley pillars (per WSB-01) and explicitly maintain L2 on the others, with documented reasons. This is the realistic target, not the aspirational L4-across-all.
FINDINGS · §16 · COUNTERINTUITIVE FINDING 5 · NON-MONOTONIC TEAM-SIZE EFFECT. Maturity correlates with team size in unexpected non-monotonic ways. The L2-to-L3 transition is hardest at 3-5 engineers, not at the larger team sizes you'd expect.
Below 3 engineers, the team is small enough that the entire system fits in one person's head and L3 standardization is overkill — the team operates effectively at L2 with informal coordination. Above 5 engineers (roughly), the team has enough specialization and organizational structure to support L3 discipline naturally; the dedicated tooling and standards documents have an audience that benefits from them. Between 3 and 5 engineers, the team is large enough that L2 informal coordination breaks down (the inevitable miscommunication produces visible defects) but small enough that no one has bandwidth to author the L3 standards documents and acceptance test suites.
Teams at this size oscillate between L2 (work as a tight pair) and L3 (formalize) without successfully transitioning, often for 6-12 months. The mechanism is sociological: the L3 work requires sustained attention from at least one engineer dedicated to discipline-engineering, which a 3-5-person team cannot easily afford. Workspaces that successfully cross 3-5-engineer size while maintaining L3 discipline typically had a founder-engineer who was disposed toward documentation work; workspaces that did not had to grow past 6-7 engineers before the transition completed.
FINDINGS · §17 · COUNTERINTUITIVE FINDING 6 · ARTIFACT-GAMING RISK. The chief operational risk of any artifact-based maturity ladder is artifact-gaming: teams produce the required artifacts without the underlying discipline. We observed this in 2 of 7 audited workspaces (the artifacts existed but were not used in practice).
In one workspace, the variance-tracking dashboard existed and was populated, but no one looked at it; metrics drifted out of acceptable ranges without any action. In the other workspace, the standards document existed but had been written 18 months ago and no longer described how the team actually operated. Both workspaces had passed L3 in their self-attestation but failed when the live-data and cross-artifact-reference rules (§10) were applied strictly.
We counter artifact-gaming with three mechanisms: (a) artifacts must reference live data (e.g., a metrics dashboard must show metrics from the last 7 days, not historical snapshots); (b) at L3+ levels, the artifacts must reference each other (an artifact in isolation is suspect, an artifact that quotes another artifact is harder to fake); (c) a 6-month re-audit cycle catches stale artifacts that have decayed since first scoring. These mechanisms reduce gaming but do not eliminate it; we discuss residual risk in §23.
FINDINGS · §18 · COUNTERINTUITIVE FINDING 7 · FAST TRANSITIONS IN AGENTIC INFRASTRUCTURE. CMMI's "Defined" L3 to "Managed" L4 transition in classical software took 2-3 years for typical organizations. In agentic systems we observe 6-9 months when discipline is enforced, suggesting agentic infrastructure may be MORE structured than legacy software for maturity-ladder purposes — when the workspace is designed for it from the start.
Three mechanisms drive the speed-up. (a) FASTER FEEDBACK CYCLES. Per-task feedback (seconds to hours) lets a team iterate on their maturity practices faster than per-release feedback (3-6 months) in classical software. (b) MORE STRUCTURED OUTPUT. Agentic system outputs (structured tool calls, JSON responses, telemetry events) are more amenable to automated measurement than classical software outputs (defect counts, feature complete dates) which require human judgment. (c) SMALLER TEAMS.
Many agentic workspaces are run by 2-5-person teams, which means the L3-to-L4 organizational alignment happens within a single team's slack channel rather than across multiple departments. The implication is that the practitioner expectation of "L3 takes years" is calibrated to classical software and is wrong for agentic systems; a well-designed agentic workspace can reach L3 in 2-4 months and L4 on selected pillars in another 4-6 months.
DISCUSSION · §19
Implications for procurement
Enterprise procurement contracts can require WAB maturity floors per pillar as deliverable acceptance criteria. We have piloted this in 3 contracts and observed materially improved vendor accountability: the buyer no longer asks "is your AI safe?" (unfalsifiable); they ask "show me your L3 Governance artifacts" (falsifiable). The shift from unfalsifiable claims to falsifiable artifacts is the central practical contribution of WAB to enterprise AI procurement.
Beyond per-pillar floors, the procurement contract can specify cluster-level floors and pillar-pair contingencies (per WSB-01 §22). A typical contract pattern
"Cluster C must be at L3 across all 4 pillars; Cluster A must be at L3 on at least 2 of 3 pillars; the remaining cluster B and D pillars must be at L2 minimum with documented improvement plan."
This contract pattern is auditable, falsifiable, and verifiable by external parties.
DISCUSSION · §20
Comparison with soc2 and iso 42001
WAB is complementary to existing compliance frameworks, not competitive. SOC2 covers security and operational controls; ISO 42001 covers AI management system processes. Neither addresses workspace-architecture maturity at the pillar level.
A workspace can be SOC2-certified and ISO-42001-conformant while still being L0 across cluster A (Cognition) — the audits do not assess agentic capability dimensions. WAB fills this gap. We recommend teams pursue all three frameworks in parallel rather than treating them as alternatives: SOC2 for security controls, ISO 42001 for AI management process, WAB for workspace architectural maturity.
The three frameworks address non-overlapping dimensions of the same overall question (is this agentic system safe to deploy?), and high-trust deployments will increasingly require evidence on all three dimensions.
DISCUSSION · §21
Three implications
Three implications. First, the L0-L4 ladder reframes the question from "is my workspace production-ready" (binary, marketing-friendly) to "which specific pillars are below L2 and what concrete artifacts must I add to move them up" (continuous, engineering-friendly). Second, the 60-cell acceptance matrix is auditable: an external party can apply it without insider knowledge, which is a prerequisite for any meaningful benchmark.
Third, the matrix produces a natural improvement roadmap: the next-best-action for any workspace is the lowest-maturity pillar with the highest deployment criticality. We close by discussing the analogy and disanalogy with software CMMI: the underlying philosophy translates, but agentic workspaces have a faster feedback cycle than 1990s waterfall software (per-task feedback vs per-release feedback), which means L4 (continuous improvement) is operationally cheaper to reach than it was in classical CMMI — provided the workspace was designed for it.
DISCUSSION · §22
Limitations
(a) The 60-cell acceptance matrix is opinionated about what counts as L2/L3/L4 artifacts; alternative specifications could produce different scores. We argue our specification is grounded in production evidence but acknowledge the specification itself is a normative choice. (b) The inter-rater reliability (kappa = 0.84) is high but not perfect; ~5% of cells have judgment-call edge cases that require qualitative discussion. (c) Cross-cultural validity is untested: all 7 audited workspaces are EU/IT or US-rooted; APAC and LatAm workspaces may exhibit different artifact patterns. (d) The decay finding (Finding 3) is based on observation across 12 months at one workspace plus shorter observation at the other 6; longer-horizon decay dynamics may differ. (e) The team-size non-monotonicity (Finding 5) is sociologically intuitive but the sample size is too small for definitive statistical claim. (f) The fast-transition finding (Finding 7) compares against classical CMMI literature, which is itself heterogeneous in its definitional precision; the comparison is directional rather than rigorously controlled.
DISCUSSION · §23
Residual artifact-gaming risk
The three anti-gaming mechanisms (live data, cross-artifact reference, 6-month re-audit) reduce but do not eliminate the gaming risk. Sophisticated adversarial gaming could in principle produce live-data dashboards that report fictional metrics, cross-referenced standards documents that describe a system other than the actual production one, and re-audit responses that paper over the divergence. We have not observed this level of sophistication in our 7-workspace audit, but a procurement context with high financial stakes could incentivize it.
The ultimate safeguard against sophisticated gaming is independent third-party audit with sample-tracing: the auditor follows specific tasks through the system and verifies that the artifacts describe the system the tasks actually flow through. This is expensive but unavoidable for high-stakes deployments.
DISCUSSION · §24
Integration with wsb-01
WSB-01 derives the 12 pillars and 4 clusters; WSB-02 (this paper) operationalizes them through the maturity ladder. The two papers should be read together. The cluster-level structure of WSB-01 has implications for how the 60-cell matrix is used in practice: per-cluster averaging produces the cluster-level scores used in weekly reviews (per WSB-01 §19); per-pillar scoring produces the detailed view used in postmortems. The Shapley weights of WSB-01 §14 inform the prioritization decisions of WSB-02: when a team must choose which pillar to advance from L2 to L3, the highest-Shapley pillar within the weakest cluster is the right target. The two papers together produce a complete framework: WSB-01 for "what dimensions" and WSB-02 for "how to advance the dimensions."
"A maturity ladder is meaningful only if the rungs are operationalized with concrete artifacts · without artifact requirements the levels collapse to subjective self-rating."— WSB-04 · Maturity L0-L4 framework
DISCUSSION · §25
Decay mitigation in practice
Finding 3 (workspace maturity decays without maintenance) has shaped our operational practice at Madani. We institutionalized three mitigation mechanisms. MECHANISM 1 · QUARTERLY RE-AUDIT.
Every quarter, every L3+ pillar is re-audited against current state. The audit takes ~3 hours per pillar per quarter (12 pillars × ~3 hours = ~36 hours per quarter, or ~1 day of one engineer's time). The cost is non-trivial but the alternative is undetected decay.
MECHANISM 2 · ARTIFACT REFRESH HOOKS. The CI pipeline includes hooks that fail the build if certain artifacts are stale (e.g., a standards document that has not been updated in 6 months blocks deployment of pillars that depend on it). This automates the staleness detection.
MECHANISM 3 · DECAY METRICS. The dashboard exposes per-pillar decay metrics (time since last refresh, time since last audit, time since last metric outside acceptable range without action). Teams can spot early decay patterns before they cause production failures.
After implementing these mechanisms, observable decay at Madani dropped from monthly to negligible over 12 months.
DISCUSSION · §26
When maturity ladders do not apply
We are explicit about scope. The L0-L4 ladder applies to workspaces engaged in ongoing operation against task-completion outcomes. The framework does not apply cleanly to: (a) ONE-OFF RESEARCH PROTOTYPES — a research notebook that runs once to produce a paper figure has no continuous operation and the maturity ladder concept does not apply. (b) ZERO-MAINTENANCE DEPLOYMENTS — a static prompt deployed once and never updated does not have a maturity ladder meaningfully different from L1. (c) HUMAN-IN-THE-LOOP HIGH-INTERVENTION WORKFLOWS — when humans are checking every output, the workspace's maturity matters less because the human verification compensates for low maturity. The framework is calibrated for the typical case: continuously operating agentic workspaces where humans cannot verify every output.
FUTURE WORK · §27
Future work
(1) Expand the matrix to 70 cells (14 pillars × 5 levels) aligned with the WSB-01 v0.4 14-pillar expansion. (2) Introduce a learned-weight composite scoring model that replaces the current uniform averaging with regression-fitted weights against task outcomes. (3) Multi-workspace replication study (target: 25 workspaces by Q4 2026) to validate the maturity-ladder structure across diverse organizational contexts. (4) Cross-language audit validation: current audits are in English; we plan IT/FR/ES replication. (5) Empirical study of the team-size non-monotonicity (Finding 5) at larger sample size (target: 50+ workspaces of varying team sizes). (6) Investigation of decay mechanisms beyond Madani's environment to test whether the 6-month decay-to-L2 half-life generalizes or is specific to fast-evolving environments.
CASE STUDIES · §28
Madani workspace maturity trajectory
We document the Madani Workspace's L2-to-L3 transition over 18 months as a case study of what the maturity transition looks like in practice. MONTH 0 · BASELINE L1.8. Most pillars at L1, with Context and Memory at L2 (the engineering team's natural strength).
MONTH 0-6 · L2 CONSOLIDATION. We invested in documentation, owner-field assignment, and telemetry emission across all 12 pillars. Average moved to L2.2 by month 6.
MONTH 6-12 · L3 PUSH ON COGNITION CLUSTER. We invested in standards documents, acceptance test suites, and variance-tracking dashboards for the 3 Cognition pillars. Cognition cluster moved to L3.4 average by month 12; non-Cognition clusters remained at L2.1-L2.4.
MONTH 12-18 · L4 PUSH ON HIGHEST-SHAPLEY PILLARS. We added improvement-velocity metrics, retrospective cadences, and reflexion logs to the 4 highest-Shapley pillars (Context, Memory, Multi-Agent DPI, Reliability). These pillars reached L4 by month 18; the other 8 pillars remained at L2.4-L3.1.
The total engineering investment over 18 months was approximately 1.5 FTE-quarters (roughly 6 months of one engineer's time spread across multiple engineers' part-time contributions). The lift in task success rate over the same period: from 0.62 to 0.83.
CASE STUDIES · §29
Anthropic cookbook audit
Anthropic's Building Agents Cookbook scored L1.3 average in our audit. The cookbook has L3 on Skills and Context (Anthropic's clear strengths) and L1 on Governance, Credentials, and Auto-Improvement. The gap reflects the cookbook's positioning: it is a learning resource, not an operational workspace, and many of the disciplines that L2-L4 measures (telemetry, variance tracking, reflexion logs) are out of scope for cookbook content.
We do not interpret the L1.3 score as a criticism of Anthropic's product; we interpret it as evidence that the cookbook serves a different purpose than the workspaces it is meant to teach. Teams who use the cookbook to bootstrap their own workspace should expect to add operational disciplines that the cookbook itself does not model.
CASE STUDIES · §30
Microsoft autogen audit
AutoGen scored L1.0 average — the lowest in the audit. The framework has L2 on Tools and L1 across all other pillars. The dominant gap is Multi-Agent DPI (L1), reflecting AutoGen's MA-default architecture which violates the DPI evidence (WSB-05).
The framework also has L1 on Memory, Auto-Improvement, and Forward-Deploy — the same pattern of "demo-strong, production-weak" we observed in several frameworks. Like Anthropic Cookbook, we interpret the L1.0 score as evidence that AutoGen is a framework for building prototypes rather than for shipping production workspaces; teams using AutoGen in production should expect to invest substantially in operational disciplines the framework does not provide.
IMPLEMENTATION PLAYBOOK · §31
Applying the 60-cell matrix
Teams reading this paper face a practical question: how to apply the matrix in their workspace. We provide a 5-step playbook based on the Madani case study (§28). STEP 1 · BASELINE AUDIT.
Apply the 60-cell matrix to the workspace's current state. Be conservative: when in doubt between L1 and L2, score L1. Document specific evidence for each cell.
The audit takes 2-4 hours for a familiar team. STEP 2 · IDENTIFY THE LOWEST-AVG CLUSTER. Compute per-cluster averages.
The cluster with the lowest average is the first remediation target (largest immediate lift potential). STEP 3 · WITHIN THE TARGET CLUSTER, PICK THE HIGHEST-SHAPLEY PILLAR (per WSB-01). Within Cognition cluster, that's Context; within Trust cluster, that's Reliability; etc.
STEP 4 · DRAFT A SPECIFIC L1-TO-L2 ARTIFACT CHECKLIST. Reference the 60-cell matrix for the specific artifacts required. Draft a project plan to ship those artifacts.
STEP 5 · MEASURE AND ITERATE. Re-score the workspace after the L1-to-L2 transition. The expected lift is +0.3 to +0.5 on the cluster average.
If the lift is smaller, the artifacts are likely surface-level and require deeper engineering.
CASE STUDIES · §31a · LANGCHAIN AUDIT. LangChain scored L1.2 average in our audit. The framework has L3 on Tools (via the rich LangChain Hub of integrations) and L1 on most other pillars.
The pattern reflects LangChain's positioning as a library-first composable framework: it gives developers powerful Tools and primitives but offers no opinionated guidance on workspace-level discipline. The dominant gap is Multi-Agent DPI (L1), reflecting LangGraph's default graph patterns which violate DPI evidence (WSB-05) by encouraging multi-agent topologies as the natural example. Memory is L1 (the framework provides memory primitives but no discipline around when to use which); Reliability is L1 (no idempotency keys, no replay infrastructure); Forward-Deploy is L1 (deployment is left entirely to the consuming team).
A team building on LangChain in production should expect to add operational disciplines that fill 8+ of the 12 pillar gaps, which is non-trivial engineering investment.
CASE STUDIES · §31b · CREWAI AUDIT. CrewAI scored L1.1 average. The framework has L2 on Skills (the crew/role abstraction is well-developed) and L2 on Tools, with L1 across all other 10 pillars.
The dominant gaps mirror those of AutoGen and LangChain: Multi-Agent DPI L1 (the crew abstraction is itself an MA-default), Memory L1, Reliability L1, Auto-Improvement L1. The framework is the strongest demo-friendly choice among the three (the crew abstraction is intuitive and shippable for prototypes) but offers the least production-ready discipline. Teams using CrewAI for production deployments often discover that the easy initial demo masks deep operational debt that becomes expensive to retire later.
CASE STUDIES · §31c · ANTHROPIC CLAUDE AGENT SDK AUDIT. The Claude Agent SDK scored L1.2 average. The SDK has L3 on Tools (Claude's native tool use is well-engineered) and L1 on most other pillars.
The pattern is consistent with the Anthropic Cookbook audit: the product surface emphasizes model and tool capabilities while leaving workspace-level discipline to the consuming team. The L1 score on Memory (no opinionated memory primitives), Reliability (no idempotency or replay), and Forward-Deploy (no deployment scaffolding) reflects design choices to keep the SDK minimal. Teams building on the SDK in production should expect to add Memory and Reliability infrastructure as their first L2-to-L3 investment.
CASE STUDIES · §31d · OPENAI AGENTS SDK AUDIT. The OpenAI Agents SDK (Python) scored L1.6 average — the second-highest among non-Madani workspaces. The SDK has L4 on Tools (the function-calling and tool-use story is mature) and L2 on multiple pillars including Skills, Observability, and Governance (the platform provides telemetry and policy infrastructure).
The L1 scores on Memory, Forward-Deploy, and Auto-Improvement reflect the SDK's coverage gaps. The relatively high overall score reflects OpenAI's investment in platform infrastructure that supports L2 maturity by default. Teams building on the SDK can reach L3 on selected pillars faster than teams starting from less-developed frameworks, though the framework's platform-lock-in does create Portability concerns (L1).
DISCUSSION · §31e · THE PLATFORM VS LIBRARY TRADEOFF. The case study comparison surfaces a structural tradeoff between platform-style frameworks (OpenAI Agents SDK) and library-style frameworks (LangChain, CrewAI, AutoGen). Platform frameworks provide L2 baseline maturity on Trust-cluster pillars (Governance, Observability, Credentials) because the platform infrastructure handles these dimensions.
Library frameworks leave these dimensions to the consuming team. The tradeoff: platform frameworks lock the team into the platform's specific decisions (which constrains Portability) while library frameworks require more engineering work but preserve flexibility. There is no universally right answer, but teams should be aware that the choice has direct maturity-score implications.
A team that wants to reach L3+ on the Trust cluster fast may benefit from a platform framework; a team that wants Portability above L2 may prefer a library framework with more engineering investment.
IMPLEMENTATION PLAYBOOK · §32
Anti-patterns we observed
ANTI-PATTERN 1 · TARGETING L4 ON EVERY PILLAR. Per Finding 4, this is structurally aspirational and the attempt exhausts engineering capacity. Target L3 on highest-Shapley pillars; explicitly maintain L2 on others.
ANTI-PATTERN 2 · ATTESTATION WITHOUT EVIDENCE. Teams self-attest without producing the live-data artifacts. The audit fails when the strict live-data rule is applied.
ANTI-PATTERN 3 · STALE ARTIFACTS. Standards documents written 18 months ago; dashboards that were created and never updated. The 6-month re-audit cycle catches these, but only if the cycle is institutionalized.
ANTI-PATTERN 4 · IGNORING DECAY. Teams that achieve L3 and then "celebrate completion" and stop maintaining the artifacts. Per Finding 3, this leads to regression to L2 within 6 months.
The L3 status is ongoing work, not a milestone. ANTI-PATTERN 5 · TREATING WAB AS COMPETITOR TO SOC2/ISO. The frameworks address non-overlapping dimensions; high-trust deployments need all three.
References
- [1]CMMI Product Team (2010), CMMI for Development, Version 1.3, Carnegie Mellon Software Engineering Institute.
- [2]Humphrey W.S. (1989), Managing the Software Process, Addison-Wesley.
- [3]Paulk M.C. et al. (1995), The Capability Maturity Model: Guidelines for Improving the Software Process, Addison-Wesley.
- [4]Crosby P.B. (1979), Quality Is Free: The Art of Making Quality Certain, McGraw-Hill.
- [5]Deming W.E. (1986), Out of the Crisis, MIT Press.
- [6]ISO/IEC (2023), 42001 Artificial Intelligence Management System.
- [7]AICPA (2017), SOC 2 Trust Services Criteria.
- [8]Gibson D.L., Goldenson D.R., Kost K. (2006), Performance Results of CMMI-Based Process Improvement, SEI Technical Report CMU/SEI-2006-TR-004.
- [9]Landis J.R. & Koch G.G. (1977), The Measurement of Observer Agreement for Categorical Data, Biometrics 33:159-174.
- [10]Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
- [11]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
- [12]Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
- [13]Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track. open ↗
- [14]CMMI Institute (2018), CMMI Model V2.0, ISACA.
- [15]ISO/IEC (2004), 15504 Software Process Assessment (SPICE).
- [16]Forsgren N., Humble J., Kim G. (2018), Accelerate: The Science of Lean Software and DevOps, IT Revolution Press.
- [17]Beyer B. et al. (2016), Site Reliability Engineering: How Google Runs Production Systems, O'Reilly.
- [18]Curtis B., Hefley B., Miller S. (2002), People CMM: A Framework for Human Capital Management, Addison-Wesley.
- [19]Anthropic (2024-2025), Building Agents Cookbook.
- [20]OpenAI (2025), Agents SDK Documentation.
- [21]LangChain (2024), Framework Architecture Guide.
- [22]Wu Q. et al. (2024), AutoGen, ICML.
- [23]Moura J. (2024), CrewAI.
- [24]Madani Lab (2026), WAB Acceptance Matrix v0.3.4 (60 cells, MIT-licensed).
- [25]Madani Lab (2026), WAB-9 Specification v0.3.4.
