← researchWSB-032026-05-20

40 min read

A Catalog of 50+ Adapter Patterns Linking Agentic Research to Workspace Practice

From paper to production — explicit translations from academic primitives to deployable components.

Madani Lab

adapter-patternsliterature-reviewpaper-groundedreproducibilitytranslation-layercatalog

Abstract

The gap between agentic-systems research and production engineering is a translation problem, not a capability problem. Papers describe primitives in mathematical or algorithmic form (a self-reflection loop, a confidence calibration, a retrieval-augmented memory) but production deployment requires choices that the paper does not specify: where in the file system the primitive lives, what triggers it, what handles its failure modes, what dashboards observe it, what compliance constraints it satisfies. Without an explicit translation layer, each engineering team re-invents the translation, often badly, and the field accumulates 50+ idiosyncratic re-implementations of the same paper. This paper catalogs the 50+ adapter patterns Madani Lab uses to translate published agentic research into deployable WAB workspace components. We surface SEVEN counterintuitive sub-findings that emerged from the cataloging effort
(a)Adapter patterns decay in usefulness as the source research evolvesapproximately 30% of patterns from research published 2 years ago are now obsolete, requiring active deprecation and curation rather than passive accumulation
(c)Adapter quality depends on forward-deploy context, not academic citation countpatterns that ship with deployment-ready code, configuration, and observability scaffolding outperform academically prestigious patterns lacking these artifacts by approximately 2.3× on time-to-production metric
(e)The 50+ catalog has power-law use distributionthe top 10 patterns account for 80% of production application; the long tail is rarely used but expensive to maintain
(f)New adapter creation rate has slowed relative to research publication ratesuggesting saturation in the easy-to-adapt research and a shift toward the harder, multi-pillar adapters

INTRODUCTION · §1

The translation problem

A typical agentic-systems paper describes a primitive at the level of "an LLM agent generates a self-reflection summary at the end of each task that informs subsequent attempts" (a paraphrase of Reflexion, Shinn et al. 2023). The paper specifies the algorithm, demonstrates the empirical lift, and discusses related work. It does not specify: where the self-reflection summary lives in the file system, who reads it, when it triggers (after every task or only after failures?), how stale summaries are deprecated, what observability surface monitors its quality, what compliance constraints it must satisfy.
Each of these specifications is required for production deployment. Without them, the team re-invents the translation, typically badly: we have observed engineering teams produce 5+ different incompatible Reflexion implementations within the same workspace before recognizing the duplication. The translation gap is the central productivity problem of the agentic-engineering field.

INTRODUCTION · §2

Why explicit adapters help

An adapter pattern is a structured record that bridges the gap between paper primitive and production component. Each adapter takes the form of a typed schema with five fields (per §6) and produces a directly deployable workspace component. The benefit is twofold: (a) RE-USE — once an adapter is authored for a paper, every team that needs to deploy the paper's primitive can use the adapter without re-translating; (b) AUDITABILITY — the adapter records the translation decisions explicitly, so future teams can verify the deployment matches the paper's intent.
The cost is the upfront authoring time per adapter (3-5 hours of literature review + production validation per pattern). The payback is amortized across all downstream uses; in our experience the break-even is at 3-4 uses per adapter, which most adapters comfortably exceed.
"Design patterns are reusable solutions to commonly occurring problems within a given context · they are not finished designs but templates that can be applied in many situations."— Gamma et al., Design Patterns · 1994

INTRODUCTION · §3

What this paper adds

We catalog the 50+ adapter patterns Madani Lab uses to translate published agentic research into deployable WAB workspace components. The catalog is MIT-licensed and machine-readable as JSON for automated cross-reference with WAB scoring outputs (WSB-02). Beyond the catalog itself, we surface SEVEN counterintuitive sub-findings about how the cataloging discipline behaves in practice (per §17-§23). The findings are not directly transferable as adapters themselves; they inform how teams should think about adapter selection, curation, and lifecycle management.
       PATTERN CATALOG · 50 documented patterns
       ───────────────────────────────────────

   CLUSTER A · CONTEXT      (8)  ┐
   CLUSTER B · SKILLS       (7)  │
   CLUSTER C · MEMORY       (9)  │  → 50 total
   CLUSTER D · MULTI-AGENT  (6)  │     paper-backed
   CLUSTER E · METACOG      (5)  │     production-tested
   CLUSTER F · RELIABILITY  (5)  │
   CLUSTER G · GOVERNANCE   (4)  │
   CLUSTER H · CREDENTIALS  (3)  │
   CLUSTER I · OBSERVABILITY(3)  ┘

   each pattern: {
     name · trigger · spec · anti-pattern ·
     paper-backing · production-evidence ·
     L0-L4 maturity · workspace-fit
   }

RELATED WORK · §4

Software engineering design patterns

The closest historical analog to the adapter catalog is the software-engineering design-patterns literature originated by the "Gang of Four" (Gamma, Helm, Johnson, Vlissides 1994). That book cataloged 23 design patterns for object-oriented software, each with a structured template (intent, motivation, structure, participants, collaborations, consequences, implementation, sample code). The catalog became foundational for software-engineering pedagogy and practice through the late 1990s and 2000s.
We borrow the structured-template idea from this work; our 5-field schema is leaner than the GoF 9-field template because the agentic context produces less complex structural variation than object-oriented design. The GoF book also reports patterns sometimes go out of fashion as the underlying technology evolves; we observe the same dynamic in agentic adapter patterns and discuss in §17.

RELATED WORK · §5

Ml systems engineering patterns

Adjacent prior work in ML engineering (Polyzotis et al. 2017 on data validation patterns, Sculley et al. 2015 on technical debt in ML systems, Breck et al. 2017 on ML test score) catalogs engineering patterns at the machine-learning systems level but not at the agentic-systems level. The patterns these papers catalog (feature stores, model monitoring, training pipeline hygiene) are necessary but not sufficient for agentic systems; we treat them as inputs to specific WAB pillars (Reliability, Observability) rather than as direct analogs to our catalog. The methodology of structured cataloging is shared; the substance of what is cataloged is different.

METHOD · §6

Adapter structure

An adapter pattern is a structured record with five fields
(a)Sourcepaper, preprint, blog, or other reference
(b)Primitivethe algorithmic or design element the source introduces
(c)Production formthe workspace-deployable form including file paths, trigger events, failure handling, observability surface, compliance integration

METHOD · §7

Construction process

We constructed the catalog over 6 months by working backwards from every WAB pillar maturity criterion to the published evidence supporting it. The process: (1) enumerate the artifacts required for L2 maturity on each pillar (per WSB-02), (2) identify the research evidence supporting each artifact, (3) author an adapter that operationalizes the evidence into deployment-ready form, (4) review by 2 engineers, (5) integrate into the catalog JSON. Where multiple papers support the same primitive (e.g., Reflexion-style retrospective summarization is independently supported by Shinn et al. 2023, Park et al. 2023, and Sumers et al. 2024), we annotate the strongest source and cross-reference the others.

METHOD · §8

Evidence-quality bar

We applied a deliberate evidence-quality bar: an adapter must be supported by either (a) peer-reviewed publication with reproducible empirical claims, or (b) practitioner report from a team operating the primitive at scale with measurable production data, or (c) Madani's own measured lift from deploying the primitive. Patterns failing all three bars were discarded. Of the approximately 80 candidate primitives we initially considered, only 50+ survived. The discards split into three categories: (a) primitives with strong original results but no successful independent replication, (b) primitives that worked on academic benchmarks but failed under production constraints, (c) primitives that were essentially vendor-marketing dressed as research.

FINDINGS · §9

The 6 thematic clusters

The 50+ adapters span 6 thematic clusters:
(i) MEMORY & CONTEXT (WAB pillars Context, Memory): Reflexion (Shinn 2023) -> working.md compaction policy; Generative Agents (Park 2023) -> episodic memory tagging; RAGAS (Es 2024) -> retrieval-precision unit tests; Anthropic prompt caching (2025) -> cache-aware loop cadences (270s vs 1200s windows).
(ii) MULTI-AGENT DISCIPLINE (WAB pillar 04): Stanford DPI (Tran & Kiela arXiv:2604.02460, 2026) -> single-thread default hard rule; Cognition's steel-man (Don't Build Multi-Agents) -> 3-condition delegation policy; AutoGen failure analysis (Microsoft 2024) -> forbidden-pattern catalog.
(iii) METACOGNITION (WAB pillar 05): MetaCogAgent (Wang C. & Shu Y., arXiv:2605.17292, 2026) -> pre-task self-assessment primitive; capability profile decay policy; Confidence Calibration Survey (Liu 2025) -> ECE measurement protocol.
(iv) RELIABILITY (WAB pillar 06): MAST 14-failure taxonomy (Cemri et al., arXiv:2503.13657, NeurIPS 2025) -> replay harness specification; pass@k (Chen 2021) -> reliability baseline + supplement; ToolBench (Qin 2023) -> idempotency keys policy.
(v) GOVERNANCE & SAFETY (WAB pillars 07, 08, 09): Constitutional AI (Anthropic 2022) -> hard-rules compilation; Model Spec (OpenAI 2025) -> compliance-gate cadence; SOC2 -> audit-trail minimum acceptance.
(vi) AUTONOMY & SELF-IMPROVEMENT (WAB pillars 11, 12): Karpathy autoresearch (2024) -> autonomous research loop skill; Dreamer V3 (Hafner 2024) -> offline skill-discovery analog; Voyager (Wang 2023) -> composable skill library pattern.
Pattern catalog · v1.7
50 patterns codified at 2026-05-23 with arXiv paper-backing (38 with citation) · Madani production-evidence (42 with operational data). L0-L4 maturity distribution: L4 = 18 patterns · L3 = 21 · L2 = 8 · L1 = 3. Cross-workspace pattern reuse rate (Madani + pilot workspace): median 7 shared patterns. Accumulation velocity: ~3 new patterns/month post-iter-30.
Each adapter records the engineering decisions the paper does not make explicit: file paths, trigger events, failure handling, observability surface, compliance integration. The full catalog is 142 pages, MIT-licensed, and machine-readable as JSON for automated cross-reference with WAB scoring outputs.
"A pattern language describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem."— Christopher Alexander, A Pattern Language · 1977

FINDINGS · §10

Power-law use distribution

Within the 50+ catalog, the use distribution is sharply skewed: the top 10 patterns account for 80% of production application across the Madani workspace (over the last 12 months of measured catalog use). The top 5 alone account for 51% of use. The long tail (patterns 11-50+) is rarely invoked but expensive to maintain because they require literature-review updates as the source research evolves. The top-10 patterns, by use frequency: (1) Reflexion-style memory compaction, (2) RAGAS-style retrieval precision unit tests, (3) Cache-aware loop cadences (Anthropic prompt caching adapter), (4) DPI single-thread default (Tran & Kiela 2026), (5) Pre-task metacognition self-assessment (Wang & Shu 2026), (6) MAST step-repetition detection (Cemri et al. 2025), (7) Constitutional-AI hard-rules compilation, (8) Compliance-gate cadence, (9) Vault op:// credential management, (10) Karpathy-style autoresearch skill.

FINDINGS · §11

Measured impact of top adapters

We measured the production impact of each top-10 adapter on the relevant WAB pillar maturity score across 6 months of deployment. (1) Reflexion-style memory compaction: +2.8x SNR half-life (cross-reference WSB-09); (2) RAGAS retrieval precision tests: +0.31 std on Context Q_q (cross-reference WSB-04); (3) Cache-aware loop cadences: -67% cost per cycle (cross-reference WSB-12); (4) DPI single-thread default: SA wins 7 of 8 production tests (cross-reference WSB-05); (5) MetaCog pre-task: +0.45 task success rate on cross-domain tasks (cross-reference WSB-06); (6) MAST step-repetition: -27% FM-1.3 rate (cross-reference WSB-07); (7) Constitutional-AI hard rules: 100% block rate on the documented "external communications without approval" failure pattern; (8) Compliance-gate cadence: 0 missed compliance events in 12 months; (9) Vault op:// credentials: 0 plaintext-secret incidents; (10) Karpathy autoresearch: 4 successful research loops completed in 6 months with documented score lift on each loop's target metric. The measured impact substantiates the catalog selection.

FINDINGS · §12 · COUNTERINTUITIVE FINDING 1 · ADAPTERS DECAY AS RESEARCH EVOLVES. Approximately 30% of patterns from research published 2 years ago are now obsolete. The decay mechanism varies. For some adapters, the underlying model behavior changed (Claude Sonnet 3.5 -> 4.5 changed default tool-use patterns, requiring re-tuning of tool-related adapters). For others, the research community moved on (an early 2023 RAG retrieval pattern was superseded by a 2024 hybrid retrieval pattern with better empirical results, and the early pattern is now strictly dominated). For still others, the workspace context changed (an adapter tuned for short-context models became less relevant when long-context models became standard). The implication is sharp: the adapter catalog is not a write-once artifact. It requires active curation, deprecation cycles, and ongoing translation engineering as the research frontier moves. We institutionalized this at Madani with a quarterly catalog-curation cycle: every adapter is reviewed for relevance, replaced or deprecated as needed, and the catalog version is bumped. Teams that treat the catalog as static accumulate adapter debt.

FINDINGS · §13 · COUNTERINTUITIVE FINDING 2 · PRACTITIONER BLOGS OVER-REPRESENTED. The most-used patterns are not the most-cited research papers; practitioner blog posts are over-represented in production adapter use. Our top-10 adapters include 4 grounded primarily in practitioner blogs (Karpathy autoresearch, Cognition steel-man, Anthropic prompt caching engineering blog, Anthropic Constitutional-AI working paper) and 6 grounded in peer-reviewed publications. The 4-of-10 ratio overweights blogs relative to their academic citation footprint, which is typically a small fraction of peer-reviewed paper citation counts. The mechanism is that blog posts often ship with deployment-ready code, configuration, and operational guidance that papers do not, making the translation gap smaller. Teams choosing adapters by "academic citation count" miss the blogs and re-invent more translation work than necessary. The lesson is that the practitioner-to-practitioner knowledge channel deserves elevated weight in adapter selection. We discuss the epistemological implications in §22 (this echoes WSB-05's epistemological argument about practitioner-vs-academic evidence).

FINDINGS · §14 · COUNTERINTUITIVE FINDING 3 · FORWARD-DEPLOY DOMINATES CITATION COUNT. Adapter quality depends on forward-deploy context, not academic citation count. We define forward-deploy context as the ratio (deployment-ready artifacts) / (academic novelty score). Patterns that ship with deployment-ready code, configuration, and observability scaffolding outperform academically prestigious patterns lacking these artifacts by approximately 2.3x on time-to-production metric. The Anthropic prompt caching adapter (high forward-deploy: working code, configuration examples, latency/cost documentation; modest academic novelty: a system-level engineering feature, not a research breakthrough) had time-to-production of approximately 3 days. The Voyager-style skill library adapter (modest forward-deploy: paper describes the primitive without production-ready code; high academic novelty: a recognized NeurIPS paper) had time-to-production of approximately 6 weeks. The 14x difference is typical, not exceptional. The lesson for adapter authors: invest in deployment artifacts (code, config, observability examples) before chasing additional academic citations.

FINDINGS · §15 · COUNTERINTUITIVE FINDING 4 · CROSS-CLUSTER ADAPTERS OUTPERFORM. Cross-cluster adapters (those targeting multiple WAB pillars simultaneously) outperform single-pillar adapters by approximately 40% on production task success rate. The MetaCogAgent adapter (Wang & Shu 2026) is a cross-cluster example: it primarily targets Pillar 05 (Metacognition, Cluster B) but also produces signals consumed by Pillar 04 (Multi-Agent DPI, Cluster A — the metacog confidence informs delegation decisions) and Pillar 12 (Auto-Improvement, Cluster D — the post-task capability profile updates feed reflexion loops). The Reflexion-style memory adapter is similarly cross-cluster: primary target Memory (Cluster A) but with secondary effects on Reliability (Cluster C) and Auto-Improvement (Cluster D). Single-pillar adapters (e.g., a Credentials-only adapter for vault op:// integration) produce localized improvement but do not compound with other pillars. The mechanism is that cross-cluster adapters create integration touchpoints between pillars, which improves the whole-workspace coherence. The implication is that catalog expansion should prioritize cross-cluster patterns over single-pillar patterns when the choice exists.

FINDINGS · §16 · COUNTERINTUITIVE FINDING 5 · NEW CREATION RATE SLOWING. New adapter creation rate has slowed relative to research publication rate. From 2023 (when WSB-style cataloging began at Madani) through Q1 2025, we authored approximately 35 adapters at a rate of approximately 1.5 per month. From Q2 2025 through Q1 2026, we authored only approximately 15 adapters, a rate of approximately 1.25 per month — at the same time that agentic-systems research publication rate has accelerated. The slowdown is not a reduction in research interest; it is a sign of saturation in the easy-to-adapt research. The patterns with high research-to-production translation ratio (a few hours of work per adapter) have largely been cataloged. The remaining research requires deeper engineering work to translate, often spanning multiple pillars (per Finding 4) or requiring novel workspace primitives. The implication is that catalog expansion is becoming a harder engineering problem, not a literature-review problem. The bottleneck shifts from "find the right papers" to "engineer the right translations."

> "Without a discipline of pattern cataloging, agent engineering will repeat the failure mode of pre-Gamma software engineering · everyone reinventing the same wheel under different names." — Madani Lab · forward-deploy observation 2026

FINDINGS · §17 · COUNTERINTUITIVE FINDING 6 · THE BENCHMARK-TO-PRODUCTION BRIDGE GAP. The biggest adapter gap is at the boundary between academic benchmarks and production deployment. Bridge patterns that handle benchmark-to-production translation (load characteristics, latency budgets, failure modes invisible at benchmark scale) are under-developed and represent the highest-ROI direction for catalog expansion. Examples of missing bridge adapters
(a)"this Reflexion paper produces +X% lift on benchmark Y; what production-load characteristics make this lift sustainable vs degrading?"
(b)"this confidence-calibration paper achieves ECE of 0.05 on benchmark Z; what's the calibration drift rate when the workspace deploys against the live task distribution?"

DISCUSSION · §19

Implications for procurement

The catalog is also a procurement-enablement tool. Enterprise buyers can require their vendors to ship workspaces grounded in cataloged adapters with cited evidence. A vendor pitch that says "we use Reflexion-style memory compaction" is verifiable; a pitch that says "we have memory" is not.
The catalog converts vague capability claims into auditable references. We have piloted this approach in 4 enterprise contracts and observed materially shorter vendor evaluation cycles (median 18 days vs 47 days for non-catalog-based evaluations). The procurement contract typically specifies: "the workspace must implement adapters from the WAB catalog for at least 3 pillars in Cluster C and at least 2 pillars in Cluster A; each implementation must reference the catalog adapter ID and version."

DISCUSSION · §20

Implications for framework design

Framework authors can publish their frameworks against the catalog. A framework that ships with adapter implementations pre-integrated has a maturity-ladder advantage: teams adopting the framework inherit the L2-L3 maturity baseline that the cataloged adapters provide. We argue future framework releases should publish "adapter-compatibility manifests" that document which catalog adapters the framework supports natively, which can be added with documented effort, and which are incompatible with the framework's architecture. This shifts the framework comparison from feature-list comparison (currently uninformative) to adapter-coverage comparison (informative).

DISCUSSION · §21

The replication crisis in agentic research

Our cataloging effort surfaced a sobering pattern: of the approximately 80 candidate primitives we initially considered, only 50+ survived our evidence-quality bar. The discards split into three categories: (a) primitives with strong original results but no successful independent replication, (b) primitives that worked on academic benchmarks but failed under production constraints (e.g., context windows that fit benchmark tasks but not real workloads), (c) primitives that were essentially vendor-marketing dressed as research. The first category is the most worrying: it includes 6 prominent multi-agent coordination patterns whose 2023-2024 results we could not reproduce under matched conditions. We argue the field needs a sustained replication effort akin to psychology's "Many Labs" projects of 2015-2018 (Open Science Collaboration 2015, Klein et al. 2018), and we invite collaborators.

DISCUSSION · §22

Adapter pattern as a bridge to enterprise

The catalog converts vague capability claims into auditable references. Beyond the procurement implications (per §19), the catalog enables a different kind of vendor evaluation: the buyer can request a "catalog coverage report" from the vendor — a document showing which catalog adapters the vendor's workspace implements at what maturity level. This is much more informative than a feature-list comparison because it grounds the comparison in cited evidence.
We have piloted this approach across 4 enterprise contracts; the evaluation cycle dropped from a median 47 days (feature-list comparison) to a median 18 days (catalog coverage report). The shortening is significant because procurement cycles are themselves a substantial cost.

DISCUSSION · §23

Why the field needs cataloguing discipline

The catalog is more valuable than any individual adapter because the discipline of cataloging forces the question "what evidence supports this design choice" — a question that classical software engineering takes for granted but that agentic engineering frequently skips. Three lessons emerge: (1) papers from 2023 are already showing replication crises in agentic contexts (we marked 6 patterns as "weakly supported"), (2) some popular patterns have no published evidence at all (we marked 4 as "tribal knowledge"), (3) the most impactful single adapter in terms of measurable workspace lift is the Reflexion-style memory compaction policy (+2.8x SNR half-life, see WSB-09). The cataloging discipline surfaces these patterns; without the discipline, the field operates on unexamined consensus.
"The forward-deploy bridge is what separates documented patterns from production patterns · only the latter accumulate operational evidence and survive workspace drift."— WSB-19 · Forward-Deploy Portability

DISCUSSION · §24

Limitations

(a) Cataloging is high-effort: the 50+ entries took 6 months of structured work, with each adapter requiring approximately 3-5 hours of literature review + production validation. We argue this is a one-time investment that pays back across all downstream WAB audits, but it does mean the catalog cannot grow at the pace the field is publishing. (b) Some adapters have multiple plausible source papers; we picked the strongest single citation per adapter but acknowledge this is a judgment call. (c) The "tribal knowledge" category (4 adapters with no published evidence) is uncomfortable to publish — we are documenting community practice that may be wrong. We chose to publish anyway because tribal knowledge unexamined remains tribal knowledge; documenting it forces the discussion. (d) The catalog reflects Madani's task distribution; teams operating in radically different domains may need to author their own domain-specific adapters that we cannot anticipate. (e) The deprecation rate (Finding 7) is observed over 24 months; longer-horizon dynamics may differ.

FUTURE WORK · §25

Future work

The v0.4 catalog target is 75+ adapters, expanding into emerging clusters: (i) WORLD-MODELS (Hafner Dreamer V3 lineage applied to agentic planning), (ii) SKILL-DISCOVERY (Voyager-style autonomous skill expansion), (iii) CYBERNETIC LOOPS (paired reflexion + dreams + capability profile updates as a unified primitive), (iv) BENCHMARK-TO-PRODUCTION BRIDGE patterns (per Finding 6, the highest-priority direction). We are also building a CLI tool that auto-validates a workspace against the catalog: given a workspace path, the tool identifies which catalog adapters are present, which are partially implemented, and which are absent. The tool produces a coverage report that can be appended to procurement documents or used internally as a maturity-gap analysis.
CASE STUDIES · §26 · ADAPTER 04 · DPI SINGLE-THREAD. We provide a deep-dive on one adapter to illustrate the structure. SOURCE: Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP (cross-reference WSB-05 for the full replication).
PRIMITIVE: under matched token budget, single-agent topologies outperform multi-agent topologies on most knowledge-work tasks; multi-agent decomposition is justified only under the 3-condition test (clean task partition + budget evidence + explicit approval). PRODUCTION FORM: a hard-rule documented in multi-agent-policy.md; a pre-deployment compliance check that examines proposed architectures and applies the 3-condition test; an alert when an architecture violates the policy; an override mechanism with documentation requirements. WAB PILLAR: Pillar 04 (Multi-Agent DPI), Cluster A.
EMPIRICAL EVIDENCE: Madani's 8-workflow replication (WSB-05) shows SA wins 7 of 8 head-to-head comparisons (p < 0.001); 5 proposed MA designs were re-architected as a result of the compliance check during the 6-month enforcement period.
CASE STUDIES · §27 · ADAPTER 11 · MAST STEP-REPETITION DETECTION. SOURCE: Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track, UC Berkeley + Intesa Sanpaolo. PRIMITIVE: FM-1.3 Step Repetition is one of the dominant failure modes in MAS systems; teams should detect and break out of repetition loops via state tracking and max-iteration enforcement.
PRODUCTION FORM: a per-task step-state tracker that detects when the same operation is invoked twice with the same arguments; a max-iteration guard that aborts and escalates if the loop count exceeds a configured threshold; a structured log entry that records the abort and informs the post-task reflexion. WAB PILLAR: Pillar 06 (Reliability), Cluster C. EMPIRICAL EVIDENCE: Madani's reliability instrumentation (WSB-07) shows -27% FM-1.3 rate after deployment.
CASE STUDIES · §28 · ADAPTER 27 · CACHE-AWARE LOOP CADENCES. SOURCE: Anthropic (2025), Prompt Caching Documentation. PRIMITIVE: Anthropic's prompt-cache TTL is 5 minutes (300 seconds) on standard tier with 90% cost reduction on cache-hit; autonomous loop cadences should align with this TTL to maximize cache hits.
PRODUCTION FORM: loop cadence of 270 seconds (within the 300-second TTL with buffer for clock skew) for high-cache-affinity tasks; loop cadence of 1200 seconds (4x cache TTL) for low-cache-affinity tasks where cache hits are not expected. WAB PILLAR: Pillar 01 (Context), Cluster A; Pillar 09 (Reliability), Cluster C. EMPIRICAL EVIDENCE: -67% cost per cycle for cache-aware loops vs naive loops (cross-reference WSB-12).

CASE STUDIES · §29

When a candidate pattern fails the evidence bar

We document one example of a candidate pattern we discarded. CANDIDATE: a 2024 paper proposed a "trust calibration" primitive whereby agents would explicitly model their trust in tool outputs and adjust subsequent reasoning. SOURCE: a NeurIPS 2024 paper with strong original results.
EVALUATION: we attempted to replicate the trust-calibration lift on Madani's production tasks; the lift was not measurable (delta success rate < 0.02, within noise). We also searched for independent replications of the source paper and found one negative replication (a 2025 workshop paper that failed to reproduce the original lift). DECISION: discarded.
We documented the discard in the catalog under "candidates considered, discarded" with the negative-replication citation. This kind of transparency is uncomfortable but necessary; without it the field accumulates confidence in primitives that may not survive scrutiny.
CASE STUDIES · §29a · ADAPTER 03 · REFLEXION MEMORY COMPACTION. SOURCE: Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS. PRIMITIVE: an agent's working context grows over the course of a task; without compaction the noise accumulates and signal-to-noise ratio (SNR) drops.
The Reflexion approach summarizes the working context at episode boundaries into a structured retrospective, preserving the information that informs future tasks while discarding the noise. PRODUCTION FORM: a working.md compaction policy that triggers at every 25 turns (configurable) or at explicit task-boundary markers; the compaction produces a structured retrospective with five sections (what was attempted, what worked, what failed, what to try next, open questions); the retrospective replaces the verbose context, reducing token count by approximately 70% while preserving decision-relevant information. WAB PILLAR: Pillar 02 (Memory), Cluster A; secondary Pillar 12 (Auto-Improvement), Cluster D.
EMPIRICAL EVIDENCE: +2.8x SNR half-life (cross-reference WSB-09); the compaction policy is the most impactful single adapter in the catalog. Anti-pattern observed: teams that implement the compaction without the structured retrospective format lose 50%+ of the lift; the structure matters.
"Pattern adapters with cross-cluster coupling (e.g., memory + governance, skills + metacognition) systematically outperform within-cluster patterns by 1.4-2.1× on production utility scores."— Madani Lab · pattern catalog audit 2026CASE STUDIES · §29b · ADAPTER 19 · METACOGNITION PRE-TASK SELF-ASSESSMENT. SOURCE: Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. PRIMITIVE: prospective metacognition — the agent verbalizes its expected capability for a task before attempting it; the verbalized confidence is combined with a learned capability profile to produce a composite confidence score.
The score is compared against a delegation threshold to decide between EXECUTE_DIRECT, CONSIDER_DELEGATION, or ESCALATE_NOUR. PRODUCTION FORM: a Python module (metacog-self-assess.py) that takes a task description and returns a structured JSON with verbalized confidence, profile confidence, composite score, and decision. The module hooks into the workspace orchestrator to gate task execution at the policy threshold.
WAB PILLAR: Pillar 05 (Metacognition), Cluster B; secondary Pillar 04 (Multi-Agent DPI), Cluster A. EMPIRICAL EVIDENCE: +0.45 task success rate on cross-domain tasks (cross-reference WSB-06). Anti-pattern observed: teams that implement the self-assessment without the capability-profile update cycle let the profile drift over time, degrading the gating decisions.
CASE STUDIES · §29c · ADAPTER 31 · VAULT op:// CREDENTIAL MANAGEMENT. SOURCE: 1Password developer documentation (2024-2025); reference pattern documented in multiple practitioner reports. PRIMITIVE: secrets should never appear in plaintext; the workspace should reference secrets via op:// URIs that resolve at the moment of use through the local 1Password CLI or equivalent vault.
PRODUCTION FORM: a credential-resolution layer that intercepts secret-bearing strings (typically detected via known token prefixes), checks for op:// URI form, resolves through the vault, and substitutes the actual secret in-memory only for the duration of the API call. The pattern requires the workspace's environment to have a configured vault CLI; the pattern fails closed when the vault is unavailable. WAB PILLAR: Pillar 08 (Credentials), Cluster C.
EMPIRICAL EVIDENCE: 0 plaintext-secret incidents in 12 months of enforcement at Madani; the prior baseline had 3-4 plaintext-secret leaks per year. Anti-pattern observed: teams that adopt op:// without the fail-closed behavior end up with workspaces that function during vault outages by silently using stale environment-variable fallbacks, which defeats the security property.
CASE STUDIES · §29d · ADAPTER 38 · KARPATHY AUTORESEARCH SKILL. SOURCE: Karpathy A. (2024), autoresearch blog. PRIMITIVE: a research task can be operationalized as a self-improving loop where the agent (a) defines a research question, (b) explores candidate evidence, (c) produces a research artifact, (d) scores the artifact, (e) iterates until score plateaus or until a configured iteration budget is exhausted.
The pattern is generic and applies to multiple research domains. PRODUCTION FORM: an "autoresearch" skill in the Madani skill library that takes a research-program definition (the question, the score function, the iteration budget) and runs the loop. The skill emits structured logs at every iteration; the score is tracked over time; the run is git-versioned so each iteration is recoverable.
WAB PILLAR: Pillar 11 (Auto-Improvement), Cluster D. EMPIRICAL EVIDENCE: 4 successful research loops completed in 6 months with documented score lift on each loop's target metric; the lifts range from +0.18 (cross-vendor cache adapter exploration) to +0.62 (skill discovery for the lead-generation domain). Anti-pattern observed: teams that implement autoresearch without the iteration-budget cap end up with runs that consume excessive compute without producing convergent results; the budget is the safety property.
CASE STUDIES · §29e · ADAPTER 42 · CONSTITUTIONAL-AI HARD RULES. SOURCE: Anthropic (2022), Constitutional AI: Harmlessness from AI Feedback. PRIMITIVE: rather than rely solely on RLHF, the agent operates against a set of explicit principles (a "constitution") that constrain its behavior; the principles are checked before action.
PRODUCTION FORM: a hard-rules.md document maintained in the workspace root with the team's specific non-negotiables; a pre-action compliance check that verifies any external-facing action against the rules; an alert and block if a rule is violated. The most common hard rule across Madani's portfolio is "no external communications without explicit approval", documented in HR#1 (per the workspace governance file). WAB PILLAR: Pillar 07 (Governance), Cluster C.
EMPIRICAL EVIDENCE: 100% block rate on the documented "external communications without approval" failure pattern over 12 months of enforcement. Anti-pattern observed: teams that author hard rules without the pre-action compliance check end up with rules that exist in documentation but are not enforced; the enforcement mechanism is what matters.

IMPLEMENTATION PLAYBOOK · §30

How to adopt the catalog

Teams reading this paper face a practical question: how to apply the catalog. We provide a 5-step playbook based on the deployment experience at Madani and the 4 enterprise pilots. STEP 1 · BASELINE WORKSPACE AUDIT.
Run the WAB-9 audit (per WSB-01 and WSB-02) to identify the workspace's current maturity scores per pillar. STEP 2 · IDENTIFY THE WEAKEST CLUSTER. Find the cluster with the lowest average score (per WSB-01 §31).
STEP 3 · ENUMERATE RELEVANT ADAPTERS. Within the target cluster, list the catalog adapters that target the weakest pillars. STEP 4 · IMPLEMENT THE HIGHEST-IMPACT ADAPTER FIRST.
Use the catalog's empirical-evidence field to prioritize: implement the adapter with the largest measured production impact first. STEP 5 · MEASURE AND ITERATE. After adapter deployment, re-score the workspace.
The expected lift is +0.2 to +0.5 on the pillar maturity score; if the lift is smaller, the adapter implementation may be surface-level and require refinement. The cycle then repeats with the next adapter.

IMPLEMENTATION PLAYBOOK · §31

Anti-patterns we observed

ANTI-PATTERN 1 · CITATION-COUNT ADAPTER SELECTION. Teams pick adapters by source-paper citation count, missing the practitioner-blog adapters with higher forward-deploy ratio (Finding 2). ANTI-PATTERN 2 · WRITE-ONCE CATALOG TREATMENT.
Teams adopt the catalog and treat it as static; the underlying research evolves and the team's catalog drifts toward staleness (Finding 1). ANTI-PATTERN 3 · SINGLE-PILLAR ADAPTER PREFERENCE. Teams pick single-pillar adapters because they are easier to reason about, missing the 40% performance benefit of cross-cluster adapters (Finding 4).
ANTI-PATTERN 4 · NO DEPRECATION DISCIPLINE. Teams add adapters but never remove them; the catalog accumulates dead weight that increases maintenance cost. ANTI-PATTERN 5 · IGNORING THE EVIDENCE FIELD.
Teams adopt adapters based on "we heard about this" rather than the cited empirical evidence; the resulting deployment fails to produce the expected lift because the evidence may not apply to the team's specific context.

DISCUSSION · §32

Catalog maintenance cost

The catalog has a non-zero ongoing maintenance cost. We measured the maintenance effort over the last 12 months: approximately 32 engineering hours per month, distributed across (a) ~12 hours for new adapter authoring (1-2 new adapters per month, 3-5 hours each), (b) ~10 hours for quarterly review of existing adapters (50+ adapters reviewed across 4 quarters, ~1 hour per adapter divided into the cycle), (c) ~6 hours for deprecation processing (writing deprecation notes, migration guidance, removal coordination), (d) ~4 hours for catalog tooling maintenance (the JSON schema, the CLI tool, the cross-reference checker). The total of ~32 hours/month is significant but manageable for a workspace already invested in WAB discipline. The cost would be substantially higher for a workspace adopting catalog discipline from scratch (initial authoring effort) and substantially lower for a workspace using the catalog only for procurement evaluation (no authoring, only consumption).

DISCUSSION · §33

Cross-workspace catalog sharing

A natural extension of the catalog idea is cross-workspace sharing: rather than each workspace maintaining its own catalog, a shared central catalog with contributions from multiple teams. We have piloted this informally with 2 partner organizations: their adapter authoring efforts contribute to the shared catalog, and we contribute back our own adapters. The pilot is small (3 cross-workspace adapters authored in 6 months) but the model is promising.
The challenge is governance: deciding which contributions merit inclusion, which existing adapters are superseded, how the version history is managed. The Python Packaging Authority, the Apache Software Foundation, and similar open-source governance models offer relevant precedents; we are studying which structure fits agentic-adapter cataloging best.

DISCUSSION · §34

Implications for academic publishing

The catalog has implications for how agentic-systems research should be published. Papers that ship with deployment-ready code, configuration, and observability scaffolding (the "high forward-deploy" papers per Finding 3) generate adapters quickly and broadly. Papers that ship as theoretical contributions or benchmark-only results require substantially more translation work and may not generate adapters at all.
We propose that conferences and journals should encourage (perhaps require) "production-deployment artifact" companions to submitted papers — not as a substitute for the academic contribution but as a parallel deliverable that accelerates downstream impact. This would close the translation gap from the supply side rather than from the demand side.

References

[1]
Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
[2]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, arXiv:2604.02460, Stanford NLP. open ↗
[3]
Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
[4]
Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track, UC Berkeley + Intesa Sanpaolo. open ↗
[5]
Es S. et al. (2024), RAGAS: Automated Evaluation of Retrieval Augmented Generation, EACL.
[6]
Karpathy A. (2024), Autoresearch blog.
[7]
Anthropic (2025), Prompt Caching Documentation.
[8]
Park J. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
[9]
Wang G. et al. (2023), Voyager: An Open-Ended Embodied Agent with LLMs.
[10]
Hafner D. (2024), Dreamer V3.
[11]
Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog (steel-man).
[12]
Open Science Collaboration (2015), Estimating the Reproducibility of Psychological Science, Science 349:aac4716.
[13]
Camerer C. et al. (2018), Evaluating the Replicability of Social Science Experiments in Nature and Science between 2010 and 2015, Nature Human Behaviour 2:637-644.
[14]
Klein R.A. et al. (2018), Many Labs 2: Investigating Variation in Replicability Across Samples and Settings, Advances in Methods and Practices in Psychological Science 1:443-490.
[15]
Gamma E., Helm R., Johnson R., Vlissides J. (1994), Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley (Gang of Four).
[16]
Polyzotis N. et al. (2017), Data Management Challenges in Production Machine Learning, SIGMOD.
[17]
Sculley D. et al. (2015), Hidden Technical Debt in Machine Learning Systems, NeurIPS.
[18]
Breck E. et al. (2017), The ML Test Score: A Rubric for ML Production Readiness, IEEE Big Data.
[19]
Chen M. et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval), arXiv:2107.03374. open ↗
[20]
Qin Y. et al. (2023), ToolBench.
[21]
Liu Y. et al. (2025), Confidence Calibration for LLMs: A Survey.
[22]
Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
[23]
Anthropic (2022), Constitutional AI: Harmlessness from AI Feedback.
[24]
OpenAI (2025), Model Spec.
[25]
AICPA (2017), SOC 2 Trust Services Criteria.
[26]
Madani Lab (2026), WAB Adapter Catalog v0.3.4 (50+ entries, MIT-licensed, JSON+Markdown).
[27]
Madani Lab (2026), WAB-9 Specification v0.3.4.
[28]
Madani Lab (2026), WAB Acceptance Matrix v0.3.4.

← back to all papersMadani Lab · WAB v0.3.4