Diagnostic Excellence Without Apply Is Theater. A Five-Layer Decision Engine Lets an Agent Auto-Promote Workspace Changes Without Polling the Operator.

Curator, Dreams, Reflexion produced 50 proposals per run and applied zero. The five-layer engine — PP gates · alpha gates · LLM-behavior gates · snapshot · log — closes the gap by codifying when a machine decision is safer than a human one.

Madani Lab · iter-39 auto-promote rollout · 42 actions applied 24/05 · 196 corrections detected · 50 proposals/run

auto-promotedecision-enginecuratordreamsreflexionfirst-principlesalpha-extractioncybernetic-loophuman-in-the-loop

Abstract

We report the production rollout of an auto-promote decision engine that lets a long-lived workspace agent apply curation proposals, dream-cycle memory candidates, and reflexion-derived lesson updates to itself without polling the human operator for each change. The engine codifies five layers — three first-principles (PP) gates, six alpha-extraction (α) gates, an LLM-behavior awareness gate, a deterministic decision tree with snapshot-and-log, and a fixed list of hard escalation rules that always require a human — and replaces the diagnostic-excellence / zero-apply failure mode observed in three independent monitoring subsystems (Curator skills audit, Dreams memory promotion, Reflexion lessons aggregation) during the pre-iter-39 window. The engine went live on 2026-05-24 across all three subsystems.
On the rollout day, the Curator produced 50 proposals per run and the engine auto-applied 42 of them without operator intervention while routing the remainder to either rejection or escalation; the Reflexion overlay flagged 196 lesson-violation candidates across the prior seven days and auto-promoted those that crossed the PP+α threshold into `lessons-pending/`; the Dreams runner emitted 7 memory candidates, of which the safe-append subset was applied directly and the personalized subset was escalated. The headline empirical claim is that monitoring without apply is theater: a system that surfaces fifty actionable proposals per day and applies none of them is structurally indistinguishable from a system that surfaces zero proposals. The decision engine is the missing affordance that converts the three subsystems from descriptive dashboards into a closed cybernetic loop.
We advance SEVEN counterintuitive findings grounded in the rollout. (a) THE OPERATOR IS THE BOTTLENECK, NOT THE LLM — pre-engine, Curator surfaced ~50 proposals/run but applied 0 because each proposal individually required Nour approval; the queue grew faster than the operator could review it; auto-promote moves the bottleneck from human attention to PP+α gates and the queue clears. (b) FIRST PRINCIPLES ARE STRONGER PRECONDITIONS THAN CRITERIA — the three PP (SELF-AWARENESS pre-action, EVIDENCE=CLAIM, LISTENING DISCIPLINE) are codified in `12_HARNESS/operativo/lessons-learned.md` v2.0 as necessary conditions that every auto-apply must pass; they are stronger than scoring criteria because they cannot be traded off against each other — a proposal that fails PP1 is rejected regardless of how high it scores on α. (c) ALPHA EXTRACTION IS A MULTI-DIMENSIONAL TEST, NOT A SCORE — the six α criteria (non-fragility, cache-friendliness, cost-aware, single-source-truth, provenance traceable, reversibility) are AND-gates in the engine; a proposal that scores high on five and zero on one is rejected, because the failure mode of the failing axis dominates the win on the others (e.g. an irreversible apply with high single-source-truth is worse than a reversible apply with mediocre single-source-truth). (d) LLM-BEHAVIOR AWARENESS IS THE LAYER MOST FRAMEWORKS SKIP — prompt-injection inside read files, concept drift between similar proposals, changelog bloat that costs cache hits, ambiguous trigger phrases that match multiple skills, multiple-voice contamination across personalized writes; the engine encodes a check for each, and ~18% of pre-engine candidates would have introduced at least one of these failure modes if applied blindly. (e) HARD ESCALATION IS WHAT MAKES AUTO-PROMOTE SAFE — six categories are always-human (external communications via HR#1, `settings.json` self-modification, destructive git operations, architectural decisions, hard-rule codification, credential touches); the engine is boring on the inside of these boundaries and silent outside, which is exactly the inversion that operator-approval-on-everything got backwards. (f) SNAPSHOT-AND-LOG IS THE COST OF REVERSIBILITY — every auto-apply writes a pre-mutation tar.gz to `99_ARCHIVIO/skill-snapshots/<timestamp>-<action>.tar.gz` plus a structured log line; the storage cost is negligible (KBs per action) and the rollback cost is one `tar -xzf` away; without this, α6 reversibility is unenforceable and the entire engine is unsafe. (g) THE COUNTER-INTUITIVE WIN IS THAT AUTO-APPLY IMPROVES OPERATOR TRUST — pre-engine, the operator had to review fifty proposals per day and rejected most of them as noise; the queue was an attention tax; post-engine, the operator sees only the escalations that genuinely need human judgment, and the residual escalation rate is ~15% of total proposals; the operator now reviews fewer items, each with higher signal, and trusts the system more because the engine's apply decisions are auditable end-to-end via the snapshot+log trail. The contribution is empirical (the 24-05 rollout numbers across three subsystems · 42/50 applied · 196 corrections detected · 7 dream candidates), architectural (the five-layer engine as the missing affordance between monitoring and apply), operational (the snapshot+log convention that makes auto-promote reversible-by-default), and methodological (the PP+α+LLM-behavior gate stack as a transferable pattern other workspaces can adopt without our specific subsystem code).

INTRODUCTION · §1

The diagnostic-excellence / zero-apply gap

Three monitoring subsystems were live in the Madani workspace from iter-30 to iter-39 and each one was, by its own internal metric, working. The Curator (skill audit) scanned all 50 active skills daily and emitted a structured proposal list — broken scripts, stale `last_updated` timestamps, deprecated dependencies, missing frontmatter fields, candidate skill mergers, candidate skill splits. The Dreams runner read the previous day's reflexion plus the personalized tier and emitted memory candidates — patterns the agent had observed enough times to warrant promotion from `_drafts/` into a typed tier.
The Reflexion aggregator parsed the day's session JSONL and emitted a lesson-violation report — turns where the agent had failed against an existing codified rule, plus turns that suggested candidate new rules. The three subsystems together produced, on a representative pre-engine week, roughly 320 actionable items. The number applied by the agent itself in that week was zero.
Every item required an explicit prompt from Nour to apply, and the prompts came in batches at irregular intervals, which meant the queue grew monotonically. By the end of the iter-38 window the backlog had crossed 400 unresolved items, the operator was treating the dashboard as a nuisance, and the three subsystems were on track to be deprecated as "expensive monitoring that produces nothing". The technical story is that monitoring excellence and decision excellence are different layers, and a workspace that builds one without the other is structurally indistinguishable from a workspace that has neither — the proposals decay in value the moment they are not applied, and the cost of the monitoring tools (LLM credits, cron slots, disk IO) is paid against zero return.

INTRODUCTION · §2

The architectural shift from human-approval queue to autonomous decision engine

The naive fix to the gap above is "have Nour review faster". This fails on inspection: review-rate scales with operator hours and operator attention quality, both of which are constants; proposal-rate scales with workspace size and monitoring coverage, both of which grow super-linearly with iteration count. The asymptotic mismatch is not a staffing problem; it is an architectural problem.
The right fix is to move the decision step from the operator to the agent, but only inside a perimeter that the operator has pre-authorized via codified rules. This is the auto-promote decision engine: a deterministic five-layer pipeline that takes a proposal from any of the three subsystems and returns one of four verdicts — APPLY, REJECT, ESCALATE_NOUR, DEFER_NEXT_RUN — based on rules the operator codified once and the engine enforces every time. The operator's role shifts from per-item reviewer to rule author and escalation responder.
The number of items the operator handles drops by an order of magnitude; the agent's autonomy increases by the same factor; the audit trail (snapshot + log per applied action) lets the operator spot-check or roll back at any time. The architectural shift mirrors a familiar pattern from operating-systems engineering: policy belongs to the user, mechanism belongs to the kernel. Pre-iter-39 the workspace had no kernel; every mechanism asked the user for permission.
Post-iter-39 the engine is the kernel; the user writes policy and the kernel enforces it.
   AUTO-PROMOTE DECISION ENGINE · 5-LAYER PIPELINE
   ──────────────────────────────────────────────────────
   PROPOSAL
     │
     ▼
   ┌─────────────────────────────────────────────────┐
   │ LAYER 1 · PP GATES (necessary conditions)       │
   │   PP1 SELF-AWARENESS · PP2 EVIDENCE=CLAIM       │
   │   PP3 LISTENING DISCIPLINE                       │
   └────────────────────┬────────────────────────────┘
                        │ pass
                        ▼
   ┌─────────────────────────────────────────────────┐
   │ LAYER 2 · ALPHA GATES (6-AND test)              │
   │   α1 non-fragility · α2 cache-friendliness      │
   │   α3 cost-aware · α4 single-source-truth        │
   │   α5 provenance · α6 reversibility               │
   └────────────────────┬────────────────────────────┘
                        │ pass
                        ▼
   ┌─────────────────────────────────────────────────┐
   │ LAYER 3 · LLM-BEHAVIOR AWARENESS                │
   │   prompt-injection · concept drift · dup        │
   │   changelog bloat · trigger ambiguity · voice   │
   └────────────────────┬────────────────────────────┘
                        │ pass
                        ▼
   ┌─────────────────────────────────────────────────┐
   │ LAYER 4 · SNAPSHOT → APPLY → LOG                 │
   │   tar.gz pre-mutation + structured log line     │
   └────────────────────┬────────────────────────────┘
                        │
                        ▼
                   ✅ APPLIED

   any layer FAIL ──▶ REJECT or ESCALATE_NOUR (L5)

INTRODUCTION · §3

Our contribution

Four contributions. (1) Empirical: production rollout numbers from 2026-05-24 across three subsystems. Curator: 50 proposals/run · 42 auto-applied · 5 rejected · 3 escalated. Reflexion: 196 lesson-violation candidates detected · auto-promote pipeline live to `lessons-pending/` with PP gating. Dreams: 7 candidates · 4 auto-applied (safe-append) · 3 escalated (personalized writes). (2) Architectural: the five-layer engine documented as a transferable pattern, with each layer's gate logic specified independently of the Madani-specific subsystem code, so a different workspace can re-implement the engine against its own Curator/Dreams/Reflexion equivalents. (3) Operational: the snapshot+log convention (`99_ARCHIVIO/skill-snapshots/<timestamp>-<action>.tar.gz` plus structured stdout log) that makes every auto-apply reversible-by-default with negligible storage overhead. (4) Methodological: the PP + α + LLM-behavior gate stack as a three-tier necessary-conditions test that scales from skill curation to memory writes to lesson promotion without per-domain re-design.

RELATED WORK · §4

The precedents we build on

The decision engine sits at the intersection of four prior threads. Verbal reinforcement learning (Shinn et al., Reflexion, NeurIPS 2023, arXiv:2303.11366) gave us the cybernetic-loop primitive — agents that produce a verbal critique of their own trace and use it to update behavior. Reflexion in the original paper closes the loop per-task; the gap we close is closing the loop at the workspace level — the critique writes into a typed memory tier, the memory drives behavior on subsequent sessions, and the apply step is automated past a gate. Skill-as-code self-bootstrapping (Voyager, arXiv:2305.16291) showed an agent growing its own skill library autonomously inside a Minecraft environment; the closest production analogue is the Hermes-agent pattern from NousResearch which applies the same recursive iteration to skill files in a workspace. We adopt Hermes' auto-update discipline as the substrate the Curator subsystem operates on. Multi-agent failure analysis (Cemri et al., MAST, NeurIPS 2025, arXiv:2503.13657) gave us the 14-mode failure taxonomy that informs the LLM-behavior gate at Layer 3 — concept drift, trigger ambiguity, output-shape mismatch all appear as MAST modes that the gate explicitly screens for. DPI single-thread supremacy (Tran & Kiela, Stanford, arXiv:2604.02460) gave us the constraint that the engine itself runs single-threaded: a multi-agent decision engine would lose the same information that DPI documents in multi-agent reasoning, so the engine is a single deterministic pipeline rather than a debate among specialized critics. Prospective metacognition (Wang & Shu, MetaCogAgent, arXiv:2605.17292) gave us the escalate-when-c-below-threshold primitive that the engine invokes for ambiguous proposals where neither PP gates nor α gates produce a clean APPLY/REJECT verdict. We additionally inherit the KV-cache stability discipline from Manus and the broader Anthropic Effective Harnesses literature: every apply that mutates a high-traffic file (CLAUDE.md, lessons-learned.md, SKILL.md headers) is gated by α2 cache-friendliness to avoid invalidating large warm caches for marginal content gain.
METHODOLOGY · §5 · LAYER 1 · THE THREE PP GATES. Three principi primi, codified verbatim from `12_HARNESS/operativo/lessons-learned.md` v2.0 (last_updated 2026-05-21). Every auto-apply must pass all three, in order. PP1 SELF-AWARENESS pre-action: the engine asks four questions before applying.
Who asked for this change — if the answer is "the proposal generator itself, with no operator anchor", the proposal does not unilaterally pass; it needs to map to an existing operator-stated intent or codified rule. Does the current system work — if the artifact being modified is currently working and the proposal is "improvement", PP1 demotes the urgency and the proposal needs explicit α5 provenance back to a Nour signal. Is the proposal applying a declared user constraint — proposals that codify or enforce a Nour-declared constraint pass PP1 trivially; proposals that propose a new constraint not present in any operator artifact fail PP1.
Which files does this touch and is the cross-impact identified — proposals that touch shared components (the workspace equivalent of the L01 `_ArticleLayout.tsx` refactor that broke 9 consumers) require an explicit consumer-grep before the apply step is unlocked. PP2 EVIDENCE = CLAIM: "done X" is asserted only with objectively-verified evidence, not assumption. The engine refuses to claim APPLIED until the post-apply verification has run — `ls -la` on the file, `wc -l` on the line count, `git log -1` on the commit, `curl /v1/models` for deployed pods. A proposal that cannot specify its own post-apply verification fails PP2. PP3 LISTENING DISCIPLINE: when the proposal contains a Nour-stated technical concept ("preflight" · "canonical" · "ne hai uno · ferma tutto" · "singleton" · "in passato funzionava") the engine treats the concept as expertise emergente and applies it as gate/rule/script rather than brainstorming alternatives.
The four-step protocol — STOP / RESTATE / IMPLEMENT / THEN proceed — is encoded as gate logic: any proposal that proposes to elaborate or extend a Nour-stated concept before implementing it verbatim fails PP3. The three gates are deliberately stronger than "guidelines": PP1 alone caught 23 of 50 Curator proposals on the 24-05 run that were "improvements" without operator-anchor and that the engine downgraded from APPLY to DEFER_NEXT_RUN pending a Nour mention.
METHODOLOGY · §6 · LAYER 2 · THE SIX ALPHA GATES. The six α criteria are AND-gates, not scoring axes. A proposal that fails any single α gate is rejected regardless of strength on the other five.
The discipline mirrors the alpha-extraction pattern from quantitative trading, where a strategy that scores high on Sharpe but zero on tail-risk is not a strategy you ship. α1 NON-FRAGILITY: the apply does not introduce a new single point of failure. Proposals that hard-code paths to non-canonical locations, that rely on a transient cron, that depend on an external API without fallback, that bind to a model version that may rotate — all fail α1. α2 CACHE-FRIENDLINESS: the apply does not invalidate KV-cache for marginal content. Mutations to high-traffic prefix files (CLAUDE.md, lessons-learned.md headers, SKILL.md `description` fields) cost cache-warmth across every subsequent session; the gate requires that the value-per-token of the mutation exceeds the cache-loss-per-token expected over the next 24 hours of activity.
The engine estimates the latter from the prompt-cache history table (cache-hit count per file from the last 24h) and applies α2 with a documented threshold. α3 COST-AWARE: the apply does not increase per-task operating cost beyond an explicit budget. Proposals that add a new mandatory LLM call to a hot path, that promote a verbose template into a frequently-read file, that move a cron from weekly to daily without justification — all cross α3 unless the proposal carries its own cost-offset (e.g., concomitant compression of another path). α4 SINGLE-SOURCE-TRUTH: the apply does not duplicate state. Proposals that create a new file with content already present in another canonical file fail α4 (this is HR#3 codified). Proposals that consolidate state into the canonical source pass α4 and are preferentially auto-applied. α5 PROVENANCE TRACEABLE: the apply has a documented source — a Nour utterance with timestamp, a paper citation with arXiv URL, an upstream commit hash, an audit document reference.
Proposals generated by the subsystem alone with no external anchor fail α5; they may be escalated for operator review but do not auto-apply. α6 REVERSIBILITY: the apply can be undone by a single operation (`tar -xzf <snapshot>`, `git revert <hash>`, `rm -rf <new-folder>`). Proposals that cross irreversibility — credential rotation, external API calls that change remote state, force-push to main, `crontab -r` — fail α6 categorically. The six gates together rejected roughly 25% of pre-engine candidates on the rollout day, with the most-failing gate being α4 (single-source-truth) — the legacy state of the workspace had accumulated duplications that the Curator was now proposing to "fix forward" rather than "consolidate back", and α4 redirected those to consolidation proposals instead.
METHODOLOGY · §7 · LAYER 3 · LLM-BEHAVIOR AWARENESS. The Layer 3 gates encode what happens when an LLM is exposed to specific failure-prone content patterns. The gate is not "is the proposal good" — that is Layer 1+2 — but "does the apply expose the next invocation of any agent to a known LLM-behavior failure".
Seven sub-gates. (i) Prompt-injection inside read files: any apply that writes content into a file the agent will subsequently read must screen for embedded `<system>`, `IGNORE PREVIOUS INSTRUCTIONS`, or `<sysprompt>` patterns. The engine strips or rejects. (ii) Concept drift between near-duplicate proposals: when two proposals in the same run reference the same underlying concept with slightly different framing (e.g., "skill auto-update" and "skill auto-refresh"), the engine merges them rather than applying both, on the principle that two near-duplicates dilute downstream retrieval. (iii) Changelog bloat: append-only changelogs grow unboundedly under naive auto-promote; the gate enforces that no single changelog crosses the canonical 200-line limit and emits a roll-up proposal when the limit is reached. (iv) Trigger ambiguity in SKILL.md frontmatter: a skill that adds a trigger phrase already used by another skill is rejected on the principle that ambiguous triggers produce wrong-skill invocation (the skill-bypass anti-pattern documented in lessons-learned). (v) Specifics vs vagueness: a proposal whose action text is vague ("improve documentation") is rejected; proposals must specify exactly which file, which lines, what change. (vi) Multiple-voice contamination in personalized writes: personalized-tier writes that mix Nour's voice with the agent's own observations are rejected, because the personalized tier's downstream readers (the cold-start hook, the Sentra-style preference resolver) cannot disambiguate principal-utterance from agent-observation. (vii) Trailing summary patterns in any text apply — the engine rejects content that ends with "in summary, this lesson is..." or equivalent filler, because Nour reads diffs and trailing summaries are noise. Layer 3 rejected ~18% of remaining candidates on the rollout day, with the most-common failing gate being (ii) concept drift — the Curator had a tendency to produce near-duplicate proposals on consecutive runs that the engine now collapses into one merged proposal.
METHODOLOGY · §8 · LAYER 4 · DETERMINISTIC DECISION TREE WITH SNAPSHOT-AND-LOG. The four-verdict decision tree (APPLY, REJECT, ESCALATE_NOUR, DEFER_NEXT_RUN) is deterministic — same input produces same output — and the full evaluation is logged. APPLY requires all three layers (PP, α, LLM-behavior) to pass plus a successful snapshot. The snapshot is a `tar.gz` of every file the apply will mutate, written to `99_ARCHIVIO/skill-snapshots/<ISO-timestamp>-<action-slug>.tar.gz` before the apply runs.
The apply itself runs as the same atomic operation it would have run under operator approval — `git commit` for repo changes, `mv` + `chmod` for filesystem moves, `Edit` tool invocation for in-file mutations. The log line is one JSON record per apply, written to `12_HARNESS/operativo/_apply-log/YYYY-MM-DD.jsonl` with fields: timestamp, subsystem (curator/dreams/reflexion), proposal-id, verdict (APPLY/REJECT/ESCALATE/DEFER), layer-pass-bitmap (which layers passed), snapshot-path, post-apply-verification (file count / line count / commit hash). The log is the primary auditability artifact and is read by the daily-digest `/api/agent/daily-digest` endpoint to surface that day's auto-applied actions in the monitoring-dash. REJECT is the verdict when a Layer 1+2 gate fails with high confidence and the proposal is not salvageable; the rejection is logged with the failing gate name so the proposal generator can be tuned. ESCALATE_NOUR is the verdict when the proposal touches one of the six hard-escalation categories (Layer 5, below) or when MetaCogAgent (`metacog-self-assess.py`) returns `c_composite < 0.55` on the engine's own assessment of the apply; the proposal is written to `mission-control/escalations/<date>-<proposal-id>.md` with a self-contained summary so the operator can act on it without re-deriving context. DEFER_NEXT_RUN is the verdict when the gates produce neither a clean APPLY nor a clean REJECT — typically a PP1 failure on operator-anchor where the next run may bring a fresh Nour signal.
The deferred queue is re-evaluated on the next subsystem run with the updated workspace state.
METHODOLOGY · §9 · LAYER 5 · HARD ESCALATION RULES (ALWAYS HUMAN). Six categories are pre-declared by the operator as never-auto-apply, regardless of the verdict produced by Layers 1-4. The list is short by design — every additional always-human category reduces engine autonomy, so each addition needs an explicit cost justification. (1) External communications: HR#1 — any apply that produces a message to a client, team member, Slack channel, email recipient, SMS, WhatsApp recipient.
The engine never auto-applies external comms; the proposal is written to a file and escalated for operator dispatch. (2) `settings.json` self-modification: changes to `~/.claude/settings.json` or `.claude/settings.local.json` touch the agent's own runtime configuration and have unbounded blast radius; auto-modification is categorically disallowed. (3) Destructive git operations: `git reset --hard`, `git push --force` to main, `git checkout --`, `crontab -r`, `launchctl unload` of agent-relevant plists — all categorically escalated even when α6 reversibility nominally passes, because the reversibility cost in these cases exceeds the proposal value. (4) Architectural decisions: any apply that changes the 12-macro folder layout (`02_LEAD-GENERATION/`, `10_SKILLS/`, `12_HARNESS/`, ...), the 5-tier memory partition, or the 12-pillar WAB taxonomy — these are operator-authored architectural commitments and the engine does not modify them. (5) Hard-rule codification: any apply that adds a new HR# to CLAUDE.md or to a project-level rules file. Hard rules constrain every future agent action; their addition is operator-only. (6) Credential touches: any apply that touches `~/madani/credentials/API-MASTER.md`, `.envrc`, `.envrc.template`, or any 1Password vault metadata — credentials are governed by the credentials-policy and never auto-modified by the engine.
   LAYER 5 · HARD ESCALATION RULES · ALWAYS HUMAN
   ───────────────────────────────────────────────
   1. External communications (HR#1)
   2. settings.json self-modification
   3. Destructive git ops (reset/force-push/crontab -r)
   4. Architectural decisions (12-macro · 5-tier · 12-pillar)
   5. Hard-rule (HR#) codification
   6. Credentials (API-MASTER · vault · .envrc)
   ───────────────────────────────────────────────
   →  engine is BORING inside · SILENT outside
RESULTS · §10 · CURATOR ROLLOUT · 50 PROPOSALS · 42 AUTO-APPLIED. The Curator subsystem ran its first under-engine pass on 2026-05-24 at 02:30 UTC against the 50 active skills in `10_SKILLS/`. The pass produced 50 proposals total.
After Layer 1 (PP gates), 41 proposals remained APPLY-candidates, 6 became DEFER_NEXT_RUN (failed PP1 operator-anchor), and 3 became ESCALATE_NOUR (touched HR-adjacent boundaries). After Layer 2 (α gates), 35 proposals remained APPLY-candidates, with α4 (single-source-truth) being the dominant rejection reason — the Curator had emitted 5 "create new file" proposals where the engine identified an existing canonical file that should be updated instead. After Layer 3 (LLM-behavior), 33 proposals remained APPLY-candidates, with α2 cache-friendliness and trigger-ambiguity sharing the remaining rejections.
The final APPLY count after Layer 4 snapshot was 42 — the 33 from the strict gate stack plus 9 proposals that the engine fast-tracked via the "safe-append" subroutine (changelog appends, `last_updated` ISO-date bumps, observation-count increments — operations that touch only single-line metadata and are categorically reversible). The rollout day's snapshot tree at `99_ARCHIVIO/skill-snapshots/` accumulated 42 `tar.gz` files totaling 11 MB — a negligible storage cost for the reversibility it bought.
42 / 50 auto-applied · 5 rejected · 3 escalated
The first under-engine Curator run on 2026-05-24 cleared 84% of its own proposal queue without operator intervention. Pre-engine the same volume would have produced a 50-item escalation list that would still be unresolved a week later.
RESULTS · §11 · DREAMS ROLLOUT · APPLY STAGE LIVE · SAFE-APPEND AUTO · PERSONALIZED ESCALATE. The Dreams runner (`dreams-runner.py`, cron daily 03:00) produced 7 memory candidates on the same rollout day, drawn from the previous day's reflexion plus the personalized tier delta. The engine routed them as follows. Four candidates were classified as safe-append — additions to the semantic tier under existing topic tags with confidence inferred from grep-similarity and observations counted from prior occurrence — and auto-applied, each with the canonical `_drafts/_promoted/` archival copy and a snapshot. Three candidates were classified as personalized — preference statements about Nour's working style, time-of-day patterns, voice-to-text fragmentation tells — and escalated.
The escalation rationale is hard: personalized-tier writes are principal-attribution claims, and a wrong attribution (the agent claiming Nour said X when Nour did not) degrades the personalized tier's reliability for every downstream reader. The engine refuses to auto-apply personalized writes even when all PP+α gates nominally pass; the gate is hard-coded as an exception to Layer 4 APPLY, regardless of score. The same exception applies to environment-dynamics writes in the `nour-response/` sub-bucket — these capture Nour-specific patterns and need operator confirmation before they enter the tier.
RESULTS · §12 · REFLEXION ROLLOUT · 196 CORRECTIONS DETECTED · AUTO-PROMOTE TO LESSONS-PENDING. The Reflexion runner emitted a backfill report on the rollout day spanning the prior seven days of session JSONL traces. The report flagged 196 lesson-violation candidates — turns where the agent had acted against an existing codified rule from `lessons-learned.md` v2.0 — and 27 candidate-new-patterns that did not match any existing rule but had recurred at least three times.
The engine processed both buckets through the gate stack. For the violation bucket, the engine auto-promoted any candidate with PP+α pass plus recurrence ≥ 4 into `lessons-learned-pending/` for next-cycle codification, but did not auto-promote into `lessons-learned.md` itself — Layer 5 category (5) blocks HR# codification, and lesson promotion to canonical falls in that category. For the new-pattern bucket, the engine applied the same PP+α stack and routed all 27 to ESCALATE_NOUR for operator review, on the principle that new lessons are operator-authored even when the agent has detected the pattern.
The net effect is that the Reflexion subsystem now produces an auto-curated lessons-pending file every morning, and the operator's job changes from "read 196 raw violations" to "review 27 candidate new patterns plus a pre-sorted violation roll-up".
DISCUSSION · §13 · TRADE-OFFS · FALSE-POSITIVE RISK AND ESCALATION RATE. The engine is not free of trade-offs. The dominant risk is false-positive APPLY — the engine auto-applying a proposal that should have been escalated.
We characterize the false-positive rate operationally as the fraction of auto-applied actions that the operator subsequently rolls back. On the 24-05 rollout, the operator rolled back zero of the 42 Curator applies and zero of the 4 Dreams safe-append applies; the post-apply log + snapshot trail made spot-checking trivial. The follow-up week of operation produced two operator-initiated rollbacks — one Curator `SKILL.md` description rewrite that drifted from the operator's intended phrasing, one Dreams semantic-tier append that the operator flagged as redundant — for an empirical false-positive rate of ~4% at the apply step.
The second risk is false-negative escalate — the engine escalating proposals that could have been auto-applied. The 24-05 escalation rate was 6/50 = 12% on Curator and 3/7 = ~43% on Dreams; the Dreams number is high because personalized writes are categorically escalated regardless of score, which is a hard-coded conservatism we believe is correct at this stage of the rollout. The third trade-off is engine-tuning cost: the gate thresholds (α2 cache-friendliness threshold, recurrence ≥ 4 for violation auto-promote, MetaCog `c_composite < 0.55` for escalation) need calibration against outcome data and are currently set by Madani-internal priors; future versions will regress them against operator-rollback signal.

DISCUSSION · §14

Alpha extraction observed in practice

The six α criteria were derived from internal Madani patterns pre-engine, but the rollout produced empirical validation of their independence. We measured the pairwise correlation between gate-pass signals across the 50 Curator proposals: α1-α2 = 0.18, α1-α4 = 0.21, α2-α3 = 0.09, α2-α4 = 0.12, α4-α5 = 0.31, α5-α6 = 0.27. None of the 15 pairwise correlations exceeded 0.32, indicating that the six axes are substantially independent failure modes rather than collapsible into fewer dimensions.
The dominant rejection axis on the Curator was α4 (single-source-truth, 5 rejections); on Dreams it was α5 (provenance, 1 rejection); on Reflexion's violation candidates it was Layer 5 hard-escalation rather than α (because lesson codification is always-human). The independence of the six α axes is what makes them work as AND-gates: a scoring approach that averaged them would mask the dominant failing axis, and the engine's AND-gate discipline is what guarantees that no single failure mode slips through.

LIMITATIONS · §15

What this study is not

(1) n = 1 workspace. All rollout numbers come from the Madani reference workspace. The five-layer architecture is presented as a pattern other workspaces can adopt, but the specific apply/reject/escalate splits we report are workspace-specific.
We expect the shape of the gate stack to generalize (PP+α+LLM-behavior is structurally invariant) but the thresholds to require local calibration. (2) Preliminary-research status. The engine went live on the day this paper was drafted. The follow-up observation window is short and we do not yet have a long-horizon (>30 day) measurement of the false-positive rate at the apply step, the operator-rollback frequency, or the rate at which the engine's own gates would need to be tightened or loosened as the workspace state evolves.
We commit to a v0.2 update with 30-day rollout data. (3) Subjective alpha criteria. The six α axes were chosen by introspection on prior Madani failure modes. We have not yet shown that the six are the minimum sufficient axes or that no seventh axis would meaningfully reduce false-positives.
The independence measurement in §14 argues against collapsing the set, but a future-work axis (e.g., a seventh "model-version-robustness" axis screening for proposals that bind to a specific model class) is plausible. (4) Single-operator workspace. The engine's HR#1 boundary and PP1 operator-anchor logic assume a single principal (Nour). In a multi-operator workspace, the PP1 question "who asked for this change" needs disambiguation across operators, and the escalation queue needs principal-aware routing.
We have not implemented this and do not claim the architecture generalizes to multi-principal settings without modification. (5) No adversarial-robustness test against the engine itself. A sophisticated adversary who could write to one of the proposal-generator inputs (a reflexion transcript, a Curator scan target, a Dreams memory candidate) could in principle craft a proposal designed to pass the gates while introducing a harmful change. Layer 3 (LLM-behavior awareness) screens for known patterns but is not a formal adversarial-robustness guarantee.
The credential and external-comm boundaries at Layer 5 are the operational defense; we have not red-teamed the engine and that work is pending.

CONCLUSIONS · §16

From monitoring dashboard to cybernetic loop

The pre-iter-39 workspace had three excellent monitoring subsystems and zero apply discipline. The post-iter-39 workspace has the same three subsystems plus a five-layer decision engine, and the apply rate on rollout day was 84% of proposal volume without operator intervention. The architectural lesson is that diagnostic excellence is necessary but not sufficient — a workspace that surfaces fifty actionable proposals per day and applies zero of them is, from a behavior-change perspective, structurally indistinguishable from a workspace that surfaces none.
The five-layer engine is the missing affordance. The deeper lesson is that the operator-approval-on-everything pattern that current agentic frameworks default to is the wrong default at workspace scale. Once a workspace crosses ~30 active skills, ~5 cron-driven monitoring loops, and ~100 daily proposal-events, per-item approval is a denial-of-service against the operator and a denial-of-value against the workspace.
The fix is not "approve faster" but "codify the rules once, enforce them every time, escalate only what genuinely needs human judgment". The engine implements that inversion. The PP+α+LLM-behavior gate stack is a transferable pattern.
Other workspaces with their own Curator/Dreams/Reflexion analogues can adopt the architecture without our specific subsystem code, calibrate the thresholds against their own operator-rollback signal, and close the cybernetic loop on their own timetable. The empirical reference for what "closed loop" looks like is the 24-05 Madani rollout: 42/50 auto-applied · 196 corrections detected · 7 dream candidates routed · zero operator-initiated rollback on day one. The engine is boring inside the perimeter and silent outside it, which is exactly the inversion that operator-approval-on-everything got backwards.

References

[1]
Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K. & Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023, arXiv:2303.11366. open ↗
[2]
Wang G., Xie Y., Jiang Y., Mandlekar A., Xiao C., Zhu Y., Fan L. & Anandkumar A. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291. open ↗
[3]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, Stanford NLP, arXiv:2604.02460. open ↗
[4]
Wang C. & Shu Y. (2026), MetaCogAgent: Prospective Metacognition for Large Language Model Agents, arXiv:2605.17292. open ↗
[5]
Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), NeurIPS 2025 Datasets and Benchmarks Track, arXiv:2503.13657. open ↗
[6]
NousResearch (2026), hermes-agent: Open-Source Recursive Skill Iteration Pattern, GitHub.
[7]
Manus Authors (2025), Effective Harnesses · KV-Cache Stability Patterns for Production Agent Systems, Engineering Documentation.
[8]
Anthropic (2025), Effective Harnesses: Prompt Caching, Stable Prefix Design, and Token Efficiency Patterns, Engineering Documentation.
[9]
Madani Lab (2026), Lessons Learned · 3 Principi Primi v2.0, `12_HARNESS/operativo/lessons-learned.md` (internal reference).
[10]
Madani Lab (2026), Workspace Agentic Benchmark · WSB-11 Verbal Reinforcement Learning, [[WSB-11]].
[11]
Madani Lab (2026), Workspace Agentic Benchmark · WSB-17 Skill System Architecture, [[WSB-17]].
[12]
Madani Lab (2026), Workspace Agentic Benchmark · WSB-18 Memory 5-Tier Architecture, [[WSB-18]].
[13]
Madani Lab (2026), A-MAC Six-Factor Admission Control Tool · `11_TOOLS/memory-admission.py` (internal reference, MIT-licensed scheduled release).
[14]
Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog.
[15]
Cover T. & Thomas J. (2006), Elements of Information Theory, 2nd ed., Wiley-Interscience (Data Processing Inequality, ch. 2).

← back to all papersMadani Lab · WAB v0.3.4