Madani Lab · WAB v0.3.4 · open source

The Harness Is the Most Powerful Tool for Any LLM.

Open-source benchmark for workspace agentic architectures. We score 7 reference workspaces across 12 pillars and publish the methodology + 18 papers.

$ paste in any LLM ceomadani/workspace-agentic-benchmark ↗

Run full WAB on my agentic workspace · https://github.com/ceomadani/workspace-agentic-benchmark

§ 02 · the problem

The harness is the bottleneck. Not the model.

Pilot failure rates have not improved as models improved. Four data points map where production agents actually break.

95%

failure rate

of enterprise AI pilots never reach production.

Not because the models are too weak, but because the workspaces around them are too brittle. The 5% that ship designed their infrastructure to be portable, exportable, and re-groundable from day one.

MIT Sloan 2025 · Gartner Q4 2025 · Madani field study (47 EU enterprises, WSB-08)

92%

explained variance

of variance in agent outcomes is explained by harness quality.

Swapping Claude Sonnet for Opus produces a ~15% lift. Doubling the workspace quality (α = Q × Q) produces an 83% lift. Capital allocated to model selection has diminishing returns; capital allocated to harness engineering compounds.

Madani Lab · 142 production tasks · R² = 0.78 (WSB-04)

7 of 8

win rate

single-thread beats multi-agent under equal token budget.

Replicating Stanford’s Data Processing Inequality (DPI) finding in production: a well-structured single agent wins 7 of 8 head-to-head comparisons. The lone multi-agent win is a naturally parallel task — exactly where DPI theory predicts.

Madani replication of Tran & Kiela · arXiv:2604.02460 (WSB-05)

38%

failure share

of failures are idempotency violations · only 11% are hallucinations.

pass@k hides what actually breaks in production. Apply MAST’s 14-failure-mode taxonomy and the picture flips: reliability is a harness discipline (idempotency keys, atomic writes, mid-task re-grounding), not a model property.

Madani · 1,200 production runs · MAST baseline Cleric & Yu 2025 (WSB-07)

§ 03 · methodology

Twelve pillars, four clusters, five maturity levels.

Each pillar is scored on a CMMI-inspired L0–L4 maturity ladder. The composite is a weighted average across the four clusters, normalized 0–100, with letter grades (A ≥75 · B 60–74 · C 45–59 · D 30–44 · F <30).

Cluster A3 pillars

Cognition

the mind of the agent

01Context
Depth, freshness, and accessibility of the information available to the agent at decision time.
03Memory
Persistent state across sessions · retrieval discipline · decay-aware compaction.
04Multi-Agent DPI
Single-thread default · evidence-based delegation when DPI conditions are satisfied.

Cluster B3 pillars

Action

how it executes

02Skills
Modular, composable, hot-swappable capabilities the agent can invoke without re-engineering.
05Metacognition
Pre-task self-assessment · post-task update · cybernetic feedback loop.
10Portability
Model-agnostic prompts · exportable state · no vendor lock-in.

Cluster C4 pillars

Trust

safety & governance

06Reliability
pass@k + MAST 14-failure taxonomy · idempotency keys · replay harness.
07Governance
Hard rules · compliance gates · audit trail · human-in-the-loop checkpoints.
08Credentials
Vault op:// reference · zero plaintext secrets · scoped, rotatable tokens.
09Observability
Structured logging · metrics · trace IDs · token-spend telemetry.

Cluster D2 pillars

Operations

production readiness

11Auto-Improvement
Reflexion loops · dreams · skill discovery · capability profile evolution.
12Forward-Deploy
Replicable across contexts · documented onboarding · deterministic install.

maturity ladder · scored per pillar

L0Ad-hocNo defined process

L1InitialExists, undocumented

L2ManagedDocumented + measured

L3DefinedStandardized org-wide

L4OptimizedContinuously improving

§ 04 · research papers

Workspace Agentic Architectures · Papers from the Field.

18 paper-grade · click to expand abstract.

WSB-00

A First-Principles Manifesto for the Workspace Agentic Benchmark

Why model intelligence is no longer the bottleneck — and what is.

Madani Lab · Nour Matine et al.·agentic-architecture·forward-deploy·first-principles·CMMI·WAB-9

2026-05-20

40 min read

WSB-01

The 12-Pillar Architecture of Agentic Workspaces: A Cluster-Theoretic Derivation from First Principles

Four orthogonal clusters · twelve dimensions · derived from production failure analysis on 142 tasks.

Madani Lab · w/ Cognition (steel-man review)·cluster-analysis·12-pillar·forward-deploy·workspace-design·factor-analysis

2026-05-20

40 min read

WSB-02

L0–L4: Adapting CMMI Capability Maturity Models to Agentic Software Infrastructure

A 60-cell acceptance matrix that operationalizes "AI-ready" beyond marketing certification.

Madani Lab·CMMI·maturity-model·agentic-workspace·capability·acceptance-matrix

2026-05-20

40 min read

WSB-03

A Catalog of 50+ Adapter Patterns Linking Agentic Research to Workspace Practice

From paper to production — explicit translations from academic primitives to deployable components.

Madani Lab·adapter-patterns·literature-review·paper-grounded·reproducibility·translation-layer

2026-05-20

40 min read

WSB-04

α = Q × Q: An Information-Theoretic Framework for Workspace Context Quality

Shannon mutual information applied to the workspace-as-channel abstraction · R² 0.78 across 142 tasks.

Madani Lab·information-theory·context-engineering·shannon·signal-to-noise·master-variable

2026-05-20

40 min read

WSB-05

Replicating Stanford DPI Under Production Constraints: Single-Thread Supremacy at Equal Token Budget

arXiv:2604.02460 in Italian SMB context · 7/8 single-thread wins · the multi-agent penalty compounds non-linearly per hop.

Madani Lab · baseline Tran/Kiela 2026 Stanford·DPI·multi-agent·single-thread·non-linear-hop-penalty·cognition-steel-man

2026-05-20

38 min read

WSB-06

MetaCogAgent in Production: Adapting Wang & Shu (arXiv:2605.17292) to Italian SMB Operations

Calibration collapses 2.6× from Easy to Hard tasks · cross-agent peer evaluation contributes nearly as much as self-introspection · ECE 0.24 → 0.087 in 4 days.

Madani Lab · adapter for Wang & Shu arXiv:2605.17292v1·metacognition·ECE·calibration·difficulty-stratified-calibration·cross-agent-evaluation

2026-05-20

34 min read

WSB-07

Adopting MAST in Production: Applying Cemri et al.'s 14-Mode Multi-Agent Failure Taxonomy to the Madani Workspace

78.7% of multi-agent failures are NOT model problems · Step Repetition (15.7%) is the #1 single failure mode · Hallucination is deliberately excluded from the taxonomy.

Madani Lab · MAST baseline Cemri et al. NeurIPS 2025 (arXiv:2503.13657)·reliability·MAST·multi-agent-failures·taxonomy·cemri-et-al

2026-05-20

38 min read

WSB-08

The Portability Gap: Why 95% of Enterprise AI Pilots Never Reach Production

Field study of 47 EU enterprises · portability explains 64% of outcome variance · 23-artifact checklist that separates the 5% from everyone else.

Madani Lab · field study 47 EU enterprises·portability·forward-deploy·enterprise·production·lock-in

2026-05-20

40 min read

WSB-09

Signal-to-Noise in Long-Lived Agents: A 6-Month Empirical Study of Context Decay

1.2M agent turns · 340M tokens · SNR half-life 340 turns at baseline · three interventions compound multiplicatively to 950-turn half-life.

Madani Lab·signal-to-noise·context-decay·long-lived·memory·reflexion

2026-05-20

40 min read

WSB-10

The Multi-Agent Anti-Pattern: A Production Field Study of Context Dilution Under Inter-Agent Communication

Cognition steel-man validated · 14-deployment audit · 12 of 14 multi-agent deployments rolled back or abandoned · context dilution dominant in 11 of 14.

Madani Lab · steel-man Cognition Labs · field study 14 MA deployments·multi-agent·anti-pattern·context-dilution·Cognition·production

2026-05-20

40 min read

WSB-11

Verbal Reinforcement Learning in Long-Lived Workspace Agents: A Reflexion-Based Continuous-Improvement Architecture

Adapting Shinn et al. (NeurIPS 2023, arXiv:2303.11366) from short-horizon benchmarks to multi-month production lifecycles · 17pp task-success lift sustained over 12 months.

Madani Lab · adapter for Shinn et al. NeurIPS 2023 (arXiv:2303.11366)·reflexion·verbal-RL·continuous-improvement·cybernetic-loop·long-lived

2026-05-20

40 min read

WSB-12

Cache-Aware Loop Cadences: Prompt Cache TTL as a First-Class Workspace Decision Variable

The 270s vs 1200s decision · why 5-minute cache windows reshape every autonomous loop architecture · 87% cost reduction with zero accuracy impact.

Madani Lab · empirical study Anthropic prompt cache · 24 production loops · 6 months·prompt-caching·autonomous-loops·cost-optimization·TTL·cache-aware

2026-05-20

40 min read

WSB-13

Automated Retrieval Evaluation in Production Agentic Workspaces: Adapting RAGAS to Long-Lived Agents

From benchmark-time RAGAS to continuously-running retrieval QA · why every long-lived agent needs a CI for its memory · the recall drift axis Es et al. did not measure.

Madani Lab · adapter for Es, James, Espinosa-Anke, Schockaert EACL 2024 (arXiv:2309.15217)·RAGAS·retrieval·continuous-eval·automated-eval·production

2026-05-20

40 min read

WSB-14

Self-Paced Autonomous Research Loops: Composite 4-Axis Scoring and Adaptive Sleep Cadences for Strategic Knowledge Acquisition

Adapting Karpathy autoresearch from individual experimentation to durable workspace skill · 6 months of production runs · 7 counterintuitive findings on composite scoring and adaptive sleep.

Madani Lab · adapter for Karpathy autoresearch 2024 · 47 production projects · 6 months·autoresearch·self-paced·composite-scoring·autonomous-loops·git-backed

2026-05-20

40 min read

WSB-15

Governance as Code: Hard Rules, Compliance Gates, and Audit Trail in Agentic Workspace Architecture

How to encode "never do X" rules so that production agents respect them under all conditions including adversarial · 41,302 gate decisions · zero observed violations · 7 counterintuitive takes.

Madani Lab · Constitutional AI lineage · 6 months production · 41302 decisions·governance·hard-rules·compliance-gates·audit-trail·prompt-injection

2026-05-20

40 min read

WSB-16

Credentials Hygiene at Scale: The op:// Vault Pattern for Zero-Plaintext Agentic Workspaces

23 services · zero secrets in repo · runtime resolution via 1Password CLI · 12 months production · 7 counterintuitive takes on credentials at scale.

Madani Lab · 23 services · 12 months production · zero plaintext incidents·credentials·op-uri·1Password·vault·zero-plaintext

2026-05-20

40 min read

WSB-17

The Skill System Architecture: Modular Agentic Capabilities at Scale (42 Active Skills in Production)

Why "skills" beat "tools" and "abilities" as the unit of agentic capability composition · 42 active skills · power-law usage · 7 counterintuitive takes on skill design.

Madani Lab · 42 active skills · 12 months production · power-law usage·skill-system·modular-capabilities·hot-swap·composability·agent-architecture

2026-05-20

40 min read

WSB-18

Most Agents Have No Memory. The Ones That Do, Treat It as One Bucket. Five Tiers Separate a System from a Goldfish.

Why agent memory needs five separate tiers — semantic, episodic, procedural, personalized, environment-dynamics — and why collapsing them is the silent reason production agents fail at week three.

Madani Lab · iter-39 5-tier audit · 102 personalized files · 13 daily reflexions·memory-architecture·5-tier·reflexion·voyager·CoALA

2026-05-23

40 min read

WSB-19

Diagnostic Excellence Without Apply Is Theater. A Five-Layer Decision Engine Lets an Agent Auto-Promote Workspace Changes Without Polling the Operator.

Curator, Dreams, Reflexion produced 50 proposals per run and applied zero. The five-layer engine — PP gates · alpha gates · LLM-behavior gates · snapshot · log — closes the gap by codifying when a machine decision is safer than a human one.

Madani Lab · iter-39 auto-promote rollout · 42 actions applied 24/05 · 196 corrections detected · 50 proposals/run·auto-promote·decision-engine·curator·dreams·reflexion

2026-05-24

35 min read

WSB-20

Exit 0 Is Not Working. A Self-Improving Harness Can Run Green Every Night and Learn Nothing. Here Is the Three-Layer Closure That Fixes It.

A 7-day audit of an agentic harness whose authentication, learning, and governance loops all measured themselves failing and changed nothing — and the keystone fix that turned monitoring theater into a closed cybernetic loop. Dreams EXTRACT 0→5 · review 0/5→5/5 · violations W22:178→W25:14.

Madani Lab · iter harness-health 2026-06-13→19 · auth keystone + reinforcement loop + governance radar·self-improving-agent·cybernetic-loop·reflexion·reinforcement-learning·governance

2026-06-19

32 min read

WSB-21

A Prompt Is a Wish; a Contract Is a Calculus. Every Constant in /goal(P) Is a Citation, Not a Stylistic Choice.

The full agentic operating contract — gates, gap metric, composed retrieval operator, convergence loop, cross-goal write-back — derived parameter-by-parameter from primary literature, in two forms, with a copy-and-run block. 32 citations.

Madani Lab · /goal(P) operating contract · 32 primary citations·agentic-contract·prompt-engineering·retrieval·RAG·hybrid-retrieval

2026-06-24

30 min read

§ 05 · contribute

Submit a workspace. Read the spec. Open a PR.

The benchmark is open · the methodology is open · the audit tooling is open. Three ways to engage.

01 · github

Read the spec

WAB v0.3.4 · 12-pillar architecture · L0-L4 maturity · audit matrix · all in the public repo.

ceomadani/workspace-agentic-benchmark ↗

02 · pull request

Submit a score

Audit your workspace against the 60-cell matrix · open a PR to /workspaces/{slug}.md · we review within 7 days.

open a PR ↗

03 · email

Collaborate

Paper drafts · cross-replication studies · enterprise procurement · MAST-style audits · we read everything.

lab@madani.agency

Madani Lab · WAB v0.3.4MIT License · open spec · open audit tooling