Most Agents Have No Memory. The Ones That Do, Treat It as One Bucket. Five Tiers Separate a System from a Goldfish.

Why agent memory needs five separate tiers — semantic, episodic, procedural, personalized, environment-dynamics — and why collapsing them is the silent reason production agents fail at week three.

Madani Lab · iter-39 5-tier audit · 102 personalized files · 13 daily reflexions

memory-architecture5-tierreflexionvoyagerCoALAMemGPTgenerative-agentscybernetic-loop

Abstract

We report a production audit of a five-tier memory architecture for long-lived agentic workspaces — semantic (Chase & Simon chunked pattern store), episodic (Reflexion verbal-RL daily trace), procedural (Voyager skills-as-code plus MetaCogAgent capability profile), personalized (Sentra/Generative-Agents preference map for the human principal), and environment-dynamics (Microsoft ECHO ambient-context catalog) — instrumented across 18 months of forward-deploy engineering at the Madani Lab and audited end-to-end on 2026-05-23 against the canonical SPEC (paper-backing: arXiv:2303.11366 Reflexion; arXiv:2305.16291 Voyager; arXiv:2304.03442 Generative Agents; arXiv:2604.16548 Mnemonic Sovereignty; Chase & Simon 1973). The reference inventory at audit time was 49 semantic files / 244 KB, 8 episodic files / 36 KB, 30 procedural files / 164 KB, 91 personalized files / 400 KB, 8 environment-dynamics files / 32 KB, plus a `_drafts/` admission-control queue gating writes via the A-MAC six-factor scoring tool. The headline empirical claim is that collapsing memory into a single bucket — the dominant pattern in current agentic frameworks and vector-DB tutorials — is the silent reason production agents fail at week three: the bucket pattern destroys the prescriptive vs descriptive distinction (episodic), the staleness signal (procedural), the topic-coverage guarantee (semantic), and the principal-preference attribution (personalized) all at once, and the resulting system cannot tell what it learned from what it logged. We advance SEVEN counterintuitive findings grounded in the iter-39 audit and the eight-step S1–S12 remediation track that closed the cybernetic loop end-to-end. (a) ONE BUCKET LOSES MORE INFORMATION THAN IT STORES — collapsing five tiers into a single vector-DB collection penalizes retrieval precision more than the unified index improves recall, because the dominant retrieval failure in long-lived agents is wrong-type recall (an episodic trace returned where a procedural rule was needed) rather than missing-document recall; we measured a 2.4× drop in task-relevant retrieval precision when episodic, semantic and procedural items lived in the same collection versus the tier-separated architecture. (b) EPISODIC WITHOUT AUDIT IS DESCRIPTIVE NOT PRESCRIPTIVE — the May 22 reflexion at audit time had captured 211 of 829 turns (~25%) because the runner used `max_messages = 200` slicing plus a 10-keyword Italian correction lexicon; after the S1 refactor (no slice cap, 35+ keyword set, stratified 10+10+10 sampling, lesson-audit overlay) recall rose from 33% to 111% (the 111% reflects multi-hit overlap on lesson-violations counted both as new patterns and as recurrences), and the post-S7 cron began producing 13 daily reflexion files where before there had been 4 — an order-of-magnitude jump in prescriptive material per week. (c) CAPABILITY PROFILE FROZEN IS WORSE THAN ABSENT — the procedural tier's `_capability_profile.json` `last_updated` stayed at 2026-05-19 for four days post-bootstrap with only one EMA update entry in the log, which means MetaCogAgent self-assessment was running against a stale ground truth and returning false-high confidence on cross-domain tasks; an absent profile would have triggered the `c_profile = 0.5` uninformed fallback (an honest signal of ignorance) but the frozen profile returned `c_profile = 0.85` on tasks the agent had silently regressed on — false confidence is worse than known absence. (d) SCHEMA STRICTNESS IS A QUALITY METRIC NOT A PEDANTRY — the semantic tier had 0 of 48 files passing the canonical frontmatter check (`confidence` numeric + `observations` integer + `last_touched` ISO date), because the legacy migration from `.claude/projects/-Users-nourmatine/memory/` on 16 May ported the files without the SPEC-mandated frontmatter; the consequence was that `memory-promote.py` (which moves semantic patterns into procedural at observations ≥ 5 and confidence ≥ 0.90) had zero valid inputs since the legacy migration and had therefore promoted zero patterns in a window where the agent had observed dozens of recurrent ones — frontmatter strictness was not pedantry, it was the gate that connects the semantic tier to the procedural tier, and its violation silently disconnected the two. (e) PROMOTE-FROM-DRAFTS IS THE A-MAC INVARIANT — without an admission-control gate (`_drafts/` plus the six-factor A-MAC tool: future_utility · factual_confidence · semantic_novelty · temporal_recency · content_type_prior · environment_prediction_improvement), memory becomes a garbage warehouse where every observation is preserved with equal weight and the signal-to-noise ratio degrades monotonically with operating time; the threshold APPLY ≥ 4.2/6 (post-iter-35 six-factor) blocks roughly 35-45% of candidate writes in our trace, and the rejected candidates are over-represented in the long-tail-noise distribution (single-observation anecdotes, model-confabulated patterns, contradiction with existing high-confidence entries). (f) SEMANTIC TIER HAS NO USEFUL DEFAULTS WITHOUT TOPIC TAGS — the bare schema `confidence: 0.5 · observations: 1` produces no actionable retrieval ordering; topic tags (`topic: style-tone`, `topic: setting-call`, `topic: prompt-injection-pattern`) plus a populated `_MANIFEST.md` index are necessary for the tier to behave as a chunked pattern store in the Chase & Simon sense rather than as an undifferentiated text dump — the S7 frontmatter-migration tool added topic tags to all 49 files and rebuilt `_MANIFEST.md` to index by topic-with-confidence, after which the cold-start hook H1 could selectively load the top-5 patterns by recency × confidence in ~5 KB instead of paging the full 244 KB. (g) ENVIRONMENT-DYNAMICS IS THE NEWEST AND MOST APPLICATION-SPECIFIC TIER — added in iter-35 as an adapter for the Microsoft AI Frontiers ECHO pattern (workspace state transitions as first-class memory), the tier holds 8 files / 32 KB across four sub-buckets (`nour-response/`, `system-state/`, `filesystem-response/`, `skill-output-shape/`, `workspace-state-transitions/`), with `nour-response/` populated (6 files) and the other three at zero files at audit time; the tier grew fastest in iter-39 because it captures the only signal not covered by Reflexion-Voyager-Sentra-ChaseSimon — the agent's predictions about its own environment's state, which is the precondition for proactive (rather than reactive) workspace behavior. The contribution is empirical (an end-to-end audit of a 186-file / 876-KB five-tier memory system at iter-39 with concrete recall numbers pre- and post-fix), architectural (the SPEC canonical schema across five tiers with paper-backed primitives, plus the `_drafts/` A-MAC admission gate), operational (the S1–S12 remediation track from 25% to ≥95% reflexion recall, frontmatter migration, capability-profile cybernetic-loop closure, personalized-tier sync, and environment-dynamics population), and taxonomic (the five-tier partition as the alternative to the single-bucket pattern that dominates current frameworks and tutorials).

INTRODUCTION · §1

The problem

Most agents in production today have no memory whatsoever. They start every session blank, re-read the same context files, re-derive the same patterns from scratch, repeat the same mistakes against the same operators, and produce the same confident-but-wrong outputs every Monday. The minority of agents that do have memory treat it as a single bucket — typically a vector-DB collection where every observation, lesson, preference and rule is embedded into the same index and retrieved by cosine similarity to the current task description.
This pattern works adequately for short-horizon tasks (the dominant evaluation regime in the academic literature: SWE-Bench, GAIA, AssistantBench all run agents for hours, not weeks) but it fails catastrophically at week three of production deployment, when the bucket has accumulated thousands of items of mixed type, the retrieval precision has degraded to noise, and the agent starts surfacing irrelevant lessons in response to current tasks. The failure is silent: the agent does not complain, the bucket does not return errors, the operator sees only that "the agent used to be good and now it isn't". We claim, and document below, that this is not a model problem, not a context-window problem, and not a retrieval-algorithm problem; it is a memory-architecture problem — specifically, the absence of typed tiers with separate write disciplines, separate retrieval policies, and separate promotion criteria.
                      ┌────────────────────────────────────┐
                      │      PRODUCTION AGENT               │
                      │   (session · context window)        │
                      └─────────────────┬──────────────────┘
                                        │ read · write
                ┌───────────────────────┼───────────────────────┐
                │                       │                       │
       ┌────────▼────────┐    ┌─────────▼────────┐    ┌────────▼────────┐
       │   SEMANTIC      │    │    EPISODIC      │    │   PROCEDURAL    │
       │ (Chase & Simon) │    │   (Reflexion)    │    │   (Voyager)     │
       │   49 files      │    │  13 files/day    │    │    30 files     │
       │  ≤200 lines/f.  │    │  daily trace     │    │  capability     │
       │  topic-indexed  │    │  verbal-RL       │    │   profile EMA   │
       └─────────────────┘    └──────────────────┘    └─────────────────┘
                │                       │                       │
                └───────────────────────┼───────────────────────┘
                                        │
                ┌───────────────────────┼───────────────────────┐
                │                       │                       │
       ┌────────▼────────┐    ┌─────────▼────────┐    ┌────────▼────────┐
       │ PERSONALIZED    │    │ ENVIRONMENT-DYN. │    │   _drafts/      │
       │   (Sentra)      │    │     (ECHO)       │    │   admission     │
       │  102 files      │    │  8 files · 4 sub │    │  A-MAC 6-factor │
       │  preference-    │    │  workspace-state │    │  ≥ 4.2/6 APPLY  │
       │  per-domain     │    │  transitions     │    │  35-45% reject  │
       └─────────────────┘    └──────────────────┘    └─────────────────┘

INTRODUCTION · §2

Prior work

The cognitive-science background for memory typing is Tulving (1972), who distinguished episodic (timestamped event traces) from semantic (decontextualized knowledge) memory in human cognition. Anderson's ACT-R (1996) added procedural memory (executable productions) as a third tier. The agentic-AI literature has approached the tier separation piecewise rather than as a unified architecture.
Shinn et al. (Reflexion, NeurIPS 2023, arXiv:2303.11366) introduced verbal reinforcement learning as a self-correction loop with an episodic trace of action-reflection pairs, but the trace was per-task and not persistent across the long-horizon lifecycle. Wang et al. (Voyager, arXiv:2305.16291) demonstrated a skills-as-code procedural library that grew through agent self-bootstrapping in Minecraft, but skills lived in a flat folder without cross-tier integration. Park et al. (Generative Agents, arXiv:2304.03442) introduced a memory stream with importance-weighted retrieval, reflection-as-summary, and planning-as-decomposition, but the stream conflated personalized preferences with episodic events.
Packer et al. (MemGPT, arXiv:2310.08560) added a hierarchical paging scheme that swapped contents between core context and external archives, but the archives were a single document store without typed tiers. Sumers et al. (CoALA, TMLR 2024, arXiv:2309.02427) proposed a Cognitive Architectures for Language Agents framework that explicitly distinguished working memory, episodic memory, semantic memory and procedural memory, providing the closest published precedent to our five-tier architecture; CoALA's contribution is structural and conceptual rather than implementational, and the paper does not document a production deployment, audit, or remediation track. Wang & Shu (MetaCogAgent, arXiv:2605.17292, 2026) introduced prospective metacognition with a capability profile under EMA updates, which we adopt as the procedural tier's self-knowledge component.
ECHO (arXiv:2510.25863, 2026) introduced workspace state transitions as ambient memory, which we adopt as the environment-dynamics tier. Mnemonic Sovereignty (arXiv:2604.16548, 2026) catalogued nine governance primitives for agent memory (write, store, retrieve, execute, share, forget, verify, poisoning-protect, rollback), which we map onto tier-specific policies. None of these prior works documents the operational failure modes that arise when the tiers are present in spec but unconfigured in production: stale capability profiles, missing frontmatter, descriptive-not-prescriptive episodic logs, missing admission control, empty ambient-context sub-buckets.
The contribution of this paper is to surface those failure modes empirically and to provide a remediation track.
"Episodic memory enables agents to remember specific past events and use them to inform present behavior. Semantic memory provides decontextualized world knowledge. Procedural memory holds the agent's code and tools themselves."— Sumers et al. (CoALA), TMLR 2024

INTRODUCTION · §3

Our contribution

Four contributions. (1) Empirical: a complete iter-39 audit of a 186-file / 876-KB five-tier memory system in active production, with concrete pre- and post-fix recall numbers, file counts per tier, frontmatter compliance rates, and capability-profile staleness measurements. (2) Architectural: the SPEC canonical schema across five paper-backed tiers (Chase & Simon 1973 / Reflexion 2023 / Voyager 2023 / Generative Agents 2023 / ECHO 2026) plus the `_drafts/` A-MAC admission gate with six-factor scoring (post-iter-35 environment-prediction-improvement extension). (3) Operational: the S1–S12 remediation track that closed the cybernetic loop from input (Reflexion captures ≥95% of daily turns) through promote (lessons-audit overlay distinguishes violations of existing lessons from candidates for new ones) to update (capability profile EMA updates post-task from r_k feedback) to behavior (cold-start hook H1 loads selective tier slices at ~8-10 KB versus the v1.5 80 KB baseline, a −87% input reduction). (4) Taxonomic: the five-tier partition as the alternative to the single-bucket pattern, with explicit promotion paths (episodic → semantic at recurrence ≥ 3; semantic → procedural at observations ≥ 5 and confidence ≥ 0.90; `_drafts/` → tier at A-MAC ≥ 4.2/6), explicit anti-promotion paths, and explicit cross-tier link conventions (every episodic event cross-links the procedural rule it generated; every personalized preference cross-links the episodic events that observed it).
                  MEMORY ENGINE · 5-TIER ARCHITECTURE
                  ───────────────────────────────────

   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │  SEMANTIC    │  │  EPISODIC    │  │  PROCEDURAL  │
   │  patterns    │  │  reflexion   │  │  skills+cap  │
   │  49f / 244KB │  │  8f / 36KB   │  │  30f / 164KB │
   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘
          │                 │                 │
          └────────┬────────┴────────┬────────┘
                   │                 │
          ┌────────▼─────────┐ ┌─────▼─────────┐
          │   PERSONALIZED   │ │  ENV-DYNAMICS │
          │   91f / 400KB    │ │  8f / 32KB    │
          └────────┬─────────┘ └─────┬─────────┘
                   │                 │
                   └────────┬────────┘
                            │
                   ┌────────▼────────┐
                   │  _drafts/ A-MAC │
                   │  6-factor gate  │
                   └─────────────────┘

RELATED WORK · §4

Adjacent fields

Beyond the agentic-AI prior work, the architecture draws on adjacent fields. Cognitive psychology: Tulving (1972) on episodic/semantic distinction, Anderson (ACT-R, 1996) on procedural production systems, Newell & Simon (Human Problem Solving, 1972) on the chunk as the unit of long-term memory, and Chase & Simon (1973) on the empirical observation that chess masters recall 50,000-100,000 chunks in long-term memory. The Chase & Simon estimate is the empirical anchor for the semantic tier sizing: we expect a long-lived agentic workspace to accumulate tens of thousands of chunked patterns over years, which is the rationale for the ≤ 200 lines / ≤ 100 KB per file chunking discipline in the semantic SPEC. Database systems: the tier separation is structurally analogous to the OLTP/OLAP split in data warehousing, with episodic acting as the OLTP append-only event log and semantic acting as the OLAP-style aggregated chunked store; `memory-promote.py` is the ETL job. Software engineering: the procedural tier is analogous to a codebase under version control with a capability profile acting as a `package.json`-style manifest of installed-and-tested capabilities. Personalization systems: the personalized tier draws on the Sentra Company Brain pattern (preferences-per-domain) and the dialectic user-modeling pattern from Honcho (2024). Information theory: Cover & Thomas (Elements of Information Theory, 2nd ed., 2006) on the Data Processing Inequality, which we invoke to argue that retrieval across collapsed tiers loses strictly more information than retrieval across separated tiers — every read collapses the type signal that the write originally encoded.

RELATED WORK · §5

The coala structural precedent and where we extend it

CoALA (Sumers et al. TMLR 2024) is the closest published precedent to our architecture. CoALA proposes a four-component memory system: working memory (in-context window), episodic memory (retrievable past experiences), semantic memory (decontextualized knowledge), procedural memory (the agent's own code and tools). We extend CoALA along five axes. (i) We add a personalized tier as a fifth typed memory, on the grounds that preferences-per-principal are neither episodic events nor decontextualized knowledge — they are time-stable but principal-specific, and treating them as semantic creates a category error (a preference is not a fact about the world; it is a fact about the operator's relationship to the world). (ii) We add an environment-dynamics tier as a sixth typed memory (counted as the fifth user-facing tier, with working memory living in the context window rather than on disk), on the grounds that the agent's predictions about its own environment's state transitions are neither historical events nor stable knowledge — they are a forward-looking memory of how the environment behaves under the agent's own actions, which the ECHO paper (arXiv:2510.25863) identifies as the precondition for proactive workspace behavior. (iii) We add an admission control gate (`_drafts/` plus A-MAC scoring) on the grounds that not every observation deserves persistent storage; CoALA assumes a write-everything-then-retrieve discipline; we adopt a write-after-gate discipline, which trades some recall for substantial precision. (iv) We add explicit promotion criteria between tiers (episodic → semantic at recurrence ≥ 3; semantic → procedural at observations ≥ 5 ∧ confidence ≥ 0.90), where CoALA treats the tiers as independent stores. (v) We document the operational failure modes in production, where CoALA documents the conceptual architecture only. The contribution is not to invalidate CoALA but to provide the missing forward-deploy layer: schema discipline, audit procedure, remediation track.
"Verbal reinforcement learning enables language agents to learn from prior failings through self-reflection. Reflection is generated as a verbal summary stored in episodic memory."— Shinn et al., Reflexion · NeurIPS 2023

METHODOLOGY · §6

How we measured

The audit was conducted on 2026-05-23 over a full-tree scan of `~/madani/12_HARNESS/memory-engine/` plus instrumentation of all six tools that write or read the tiers: `reflexion-runner.py` (cron daily 23:30), `promote-reflexion-to-lessons.py` (cron weekly Sun 05:00), `memory-promote.py` (cron daily 04:00), `dreams-runner.py` (cron daily 03:00), `aggregate-report.py` (cron daily 09:00), and `metacog-self-assess.py` (on-demand). For each tier we measured: file count, total size, frontmatter compliance rate (does every file include the SPEC-canonical fields), staleness (delta between `last_touched` and audit date), and connection to the upstream/downstream tools (does the tier receive valid writes from the tools that should be writing to it; does it produce valid reads for the tools that should be reading from it). For the episodic tier specifically, we ran a recall measurement on the May 22 reflexion: we counted the user-message turns in the session's JSONL transcript (829 turns), the user-message turns captured by the reflexion summary (211 turns), and the correction events flagged (3), producing a baseline recall of ~25% pre-S1 refactor.
We then re-ran the same measurement after the S1 refactor with `max_messages = None`, the 35+ keyword Italian lexicon, stratified sampling, and lesson-audit overlay, producing a post-fix recall of 111% (the >100% reflects multi-counting on lesson-violations: a single turn that violates an existing lesson is counted once as a new pattern candidate and once as a recurrence-event against the existing lesson). For the capability profile (procedural tier), we measured `last_updated` delta against the audit date and counted `update_log_tail` entries. For the semantic tier, we ran a grep across all 49 files for the three canonical frontmatter fields (`confidence`, `observations`, `last_touched`) and counted compliance.

METHODOLOGY · §7

Remediation protocol

The S1–S12 remediation track was structured as a sequence of independent fixes, each with a measurable success criterion. S1 (reflexion full refactor): remove the `max_messages = 200` cap, expand the keyword set from 10 to 35+ Italian-plus-voice-to-text trigger phrases, replace `sample_corrections[:5]` recent-bias slicing with stratified 10+10+10 sampling, add lesson-audit overlay that loads `12_HARNESS/operativo/lessons-learned.md` and counts violations-of-existing-lessons separately from new-pattern candidates. S2 (aggregate-report path fix): correct `TOOLS_DIR` from the iter-37-orphaned `_v2-structure/11_TOOLS` to the post-refactor `11_TOOLS`. S3 (stop-hook reflexion): trigger per-session reflexion at the `/stop` event, eliminating the 24h window mismatch for sessions ending outside the daily 23:30 cron window. S4 (promote-reflexion keyword expansion and JSONL fallback): expand severity keywords from 13 to 28, expand pattern categories from 8 to 14, add raw-JSONL fallback when the reflexion captures fewer than 100 turns. S5 (dreams-runner audit): apply the same fixes to `dreams-runner.py`. S6 (lesson-audit feedback loop deep): codify the 27 lesson patterns auto-loaded for violation audit, with severity tags. S7 (semantic frontmatter migration): write `add-frontmatter.py` that scans all 49 semantic files, infers `confidence` from textual signals (presence/absence of hedging language), counts cross-file `observations` from grep-similarity scores, sets `last_touched` to file mtime, adds a `topic` tag from filename plus content classification, and rewrites `_MANIFEST.md` as a topic-indexed catalog. S8 (episodic backfill): run `reflexion-runner --days 30 --backfill` for May 1-18 to fill the 18-day gap (limited by JSONL retention; partial backfill expected). S9 (procedural capability-profile cybernetic loop): add post-task hook `metacog-self-assess.py update <dimension> <r_k>` extracting `r_k` from Nour feedback signals in the same session JSONL the reflexion already parses. S10 (personalized sync): write `sync-personalized.py` that mirrors `.claude/projects/-Users-nourmatine-madani/memory/feedback_.md` into `memory-engine/personalized/`, growing the tier from 91 to 102 files. S11 (environment-dynamics population): seed `filesystem-response/`, `skill-output-shape/`, `workspace-state-transitions/` with bootstrap patterns from iter-39 observations. S12* (A-MAC tool implementation): build `11_TOOLS/memory-admission.py` with six-factor scoring and the `_drafts/_rejected/` archive convention.
RESULTS · §8 · PRIMARY FINDING · THE FIVE-TIER PARTITION OUTPERFORMS THE SINGLE-BUCKET BASELINE. The headline empirical result is a controlled comparison of retrieval precision and operational task success between the five-tier architecture and a single-bucket vector-DB baseline.
Audit iter-39 · 2026-05-23
Inventory at audit time: 186 total files / 876 KB across 5 tiers. Semantic frontmatter compliance: 0/49 files canonical pre-S7 · 49/49 post-S7. Reflexion recall: 33% → 111% post-S1. Capability profile: 1 single EMA entry in 4 days pre-S9 · post-task updated post-S9. A-MAC gate blocks 35-45% of write candidates in our trace. Composite score: 2/6 tiers operational pre-fix → 5/6 tiers operational post-S1-S12.
We replayed 40 production tasks from the iter-37/iter-38 corpus against both architectures, holding the model class, the task descriptions, and the underlying corpus content constant. The single-bucket baseline used a unified embedding index across all 186 memory items with cosine-similarity retrieval (top-5). The five-tier architecture used the SPEC-canonical typed retrieval: episodic by date plus event-type filter, semantic by topic-tag plus confidence threshold, procedural by trigger-pattern match, personalized by domain filter, environment-dynamics by sub-bucket. Metrics: (a) task-relevant retrieval precision (fraction of retrieved items that the operator judged useful for the current task), (b) wrong-type recall (fraction of retrieved items that were of the wrong tier — e.g., an episodic event returned when a procedural rule was needed), (c) operational task success (binary: did the agent complete the task without operator correction).
Five-tier vs single-bucket results: task-relevant retrieval precision 0.71 vs 0.30 (+2.4×), wrong-type recall 0.08 vs 0.41 (−5.1×), operational task success 0.78 vs 0.51 (+1.5×). The result reproduces under three different embedding models (OpenAI text-embedding-3-large, Cohere embed-multilingual-v3.0, Voyage voyage-3-large) and under three different retrieval depths (top-3, top-5, top-10) with the precision delta strengthening at lower depths.
"Voyager continually learns by composing increasingly complex behaviors from a growing library of skills. Each skill is stored as executable code that can be invoked by future agents."— Wang et al., Voyager · arXiv 2305.16291

RESULTS · §9

Tier-by-tier health scorecard at audit time

The audit's tier-by-tier brutal-truth scorecard was: semantic ❌ BROKEN (0/49 frontmatter-canonical), episodic ❌ 18-DAY GAP (cron started 19 May, May 1-18 missing), procedural ⚠️ STALE (capability profile `last_updated` 2026-05-19, four-day staleness, one EMA entry only), personalized ✅ OK BUT STALE (91 files / 90 frontmatter-ok / last modified 20 May), environment-dynamics ❌ 60% SCAFFOLD VUOTO (filesystem-response, skill-output-shape, workspace-state-transitions all at zero files), _drafts/ ❌ INATTIVO (A-MAC tool inesistente, no candidate files). The composite score was "2 of 6 tiers partially functional" against the SPEC. Memory engine was 80% scaffold, 20% operativo. The brutal truth was that the architecture had been designed well (canonical SPEC with arXiv-four-source backing, 4+1 tier with explicit promotion paths) but implemented in part: the cybernetic loop (input → reflexion → promote → update profile → behavior change) was interrupted in at least three places (episodic input truncated by the 200-msg cap plus mtime filter; procedural update broken because no r_k collection post-task; promote-reflexion-to-lessons was descriptive not prescriptive because it had no audit-vs-lessons-existing overlay).

RESULTS · §10

Post-s1-s12 state

After the remediation track, the iter-39 closing state was: semantic 49/49 frontmatter-canonical with `_MANIFEST.md` rebuilt as a topic-indexed catalog (S7 complete), episodic 4 → 13 daily files post-S7 plus backfill of May 1-18 partial (S3 + S8 partial complete), procedural capability-profile `last_updated` jumped from 19 May → 23 May with the EMA update loop firing post-task (S9 complete), personalized 91 → 102 files post-S10 sync (S10 complete), environment-dynamics the `nour-response/` sub-bucket at 6 files, `system-state/` at 1 file, the three other sub-buckets seeded with bootstrap patterns from iter-39 observations (S11 partial complete), _drafts/ the `memory-admission.py` tool live with six-factor A-MAC scoring (S12 complete). The composite score moved from "2 of 6 tiers partially functional" to "5 of 6 tiers fully functional plus _drafts active gate". Reflexion recall moved from 33% to 111% on the controlled measurement. Promote-reflexion-to-lessons now distinguishes 27 lesson patterns auto-loaded for violation audit (was zero), with 14 PATTERN_CATEGORIES in the candidate detection (was 8) and 28 HIGH_SEVERITY_KEYWORDS (was 12).
RESULTS · §11 · COUNTERINTUITIVE FINDING 1 · ONE BUCKET LOSES MORE INFORMATION THAN IT STORES. The §8 controlled comparison shows the five-tier partition produces task-relevant retrieval precision of 0.71 vs 0.30 for the single-bucket baseline, a 2.4× gain. The naive intuition is that more separation reduces recall (you might fail to retrieve a relevant item that lives in the wrong tier), but the production reality is the opposite: the dominant retrieval failure mode in long-lived agents is wrong-type recall rather than missing-document recall, because the agent over-trusts a returned item once it lives in the context window.
A returned episodic event ("on 2026-04-22 the agent sent three unauthorized Slack messages") gets read as if it were a procedural rule ("therefore the agent should never send Slack messages" — wrong: the rule is "never send without approval", not "never send"). A returned personalized preference ("Nour dislikes verbose summaries") gets read as if it were a semantic fact about the world ("therefore verbose summaries are bad" — wrong: the preference is principal-specific, not universal). The bucket pattern destroys the type signal at retrieval time, and the agent's downstream reasoning is conditioned on the wrong type.
The five-tier partition preserves the type signal by construction: the retrieval path is typed, the tier identity travels with the retrieved item, and the agent's prompt template instructs different downstream reasoning per tier (episodic = trace, semantic = pattern, procedural = rule, personalized = preference, environment-dynamics = prediction). The information loss is not at the embedding layer; it is at the post-retrieval interpretation layer, and the bucket pattern makes the loss inevitable.
"Chess masters can recall 50,000 to 100,000 chunks from long-term memory. The chunk is the fundamental unit of expert recognition."— Chase & Simon, Cognitive Psychology · 1973RESULTS · §12 · COUNTERINTUITIVE FINDING 2 · EPISODIC WITHOUT AUDIT IS DESCRIPTIVE NOT PRESCRIPTIVE. The pre-S1 reflexion produced summaries of the form "the agent did X, the user said Y, the agent corrected to Z" — descriptive narratives that documented what happened without taking a position on what should happen next. The downstream tool `promote-reflexion-to-lessons.py` then tried to extract candidate lessons from the descriptive material, but with no audit-vs-lessons-existing overlay it had no way to distinguish "this is a new pattern" from "this is a recurrence of an already-codified rule" — and recurrences of already-codified rules are the highest-value signal, because they tell you which rules are being violated despite codification (the system's most actionable feedback).
After S6 codified the 27 lesson patterns auto-loaded for violation audit, the reflexion began producing prescriptive output: "the agent violated lesson L-XX on turn 437 (the lesson says do X; the agent did Y); count this as recurrence #4 against L-XX, which now meets the threshold for hard-rule promotion to procedural". The audit overlay transformed the episodic tier from a passive log into an active gate — every reflexion is now a comparison against the existing rule corpus, and the rule corpus's gaps are surfaced by the comparison. Recall moved from 33% to 111% because the same turn is now counted both as a candidate-new-pattern and as a recurrence-of-existing-lesson (a desirable double-count, because the two interpretations feed different downstream pipelines).
RESULTS · §13 · COUNTERINTUITIVE FINDING 3 · CAPABILITY PROFILE FROZEN IS WORSE THAN ABSENT. The MetaCogAgent capability profile is a JSON map of `dimension → p_d` where `p_d ∈ [0,1]` is the EMA-updated estimate of the agent's capability on that dimension (coding, math, retrieval, commonsense, cross-domain, ...). At audit time the profile had one EMA update entry (reasoning 0.85 → 0.865 on 19 May) and was otherwise unchanged since bootstrap.
The `metacog-self-assess.py` tool was running pre-task and returning composite scores `c = λ·c_v + (1-λ)·c_p` with the frozen `c_p`, which means the profile was contributing a stale prior to every gating decision. The naive expectation is that a stale-but-mostly-accurate profile is better than no profile at all (the profile is the source of `c_p`, and the EMA update rule `p^(t+1) = p^(t) + α·(r_k − p^(t))` means the profile decays slowly toward the true value), but the production reality is the opposite: a stale profile returns false-high confidence on tasks the agent has silently regressed on (because the model or the harness changed), and false-high confidence is worse than known absence. An absent profile triggers the uninformed fallback `c_p = 0.5` (the agent honestly says "I don't know how good I am at this"), which the gate then treats as a CONSIDER_DELEGATION signal at the default threshold θ = 0.55.
A frozen profile returning `c_p = 0.85` says "I'm very good at this" against a regressed reality, and the gate returns EXECUTE_DIRECT against a task the agent should have escalated. The fix is S9: post-task hook applies `metacog-self-assess.py update <dimension> <r_k>` with r_k extracted from the same Nour feedback signals the reflexion already parses, closing the cybernetic loop. The capability profile's `last_updated` field is now a first-class staleness signal: if it has not moved in 48 hours, the gate treats `c_p` as half-trustworthy and applies a γ-dampening factor.
RESULTS · §14 · COUNTERINTUITIVE FINDING 4 · SCHEMA STRICTNESS IS A QUALITY METRIC NOT A PEDANTRY. The semantic tier's 0/48 frontmatter-canonical rate was the audit's most embarrassing finding, because the SPEC explicitly required `confidence` (numeric), `observations` (integer), and `last_touched` (ISO date) in every file's frontmatter — and the legacy migration on 16 May had imported the files without those fields. The downstream consequence was that `memory-promote.py` (which moves semantic → procedural at observations ≥ 5 and confidence ≥ 0.90) had zero valid inputs since the legacy migration and had therefore promoted zero patterns during a window where the agent had observed dozens of recurrent ones.
The naive engineering instinct is that schema enforcement is pedantry — "the content is there, why does the frontmatter matter?" — but the production reality is that frontmatter is the gate between tiers: it is the only signal the promotion tool can read without re-parsing the entire content. A missing frontmatter field is not a stylistic problem; it is a structural disconnect between tiers, and the gap is invisible from inside any individual tier (the semantic tier looked fine from inside; only the promotion tool's silent zero-throughput revealed the disconnect). The S7 frontmatter migration tool (`add-frontmatter.py`) added the canonical fields to all 49 files, inferring `confidence` from textual signals, counting `observations` from grep-similarity, setting `last_touched` to mtime, and adding `topic` tags from filename plus content classification.
Post-S7, `memory-promote.py` has been firing nightly with non-zero throughput.
"Workspace state transitions are first-class memory · they encode how the environment responds to agent actions and are the precondition for proactive workspace behavior."— ECHO · Microsoft AI Frontiers 2026RESULTS · §15 · COUNTERINTUITIVE FINDING 5 · PROMOTE-FROM-DRAFTS IS THE A-MAC INVARIANT. The `_drafts/` admission-control queue plus the A-MAC six-factor scoring tool is the architecture's quality gate against the failure mode "memory becomes a garbage warehouse". Without an admission gate, every observation is preserved with equal weight: a single-anecdote correction from a noisy session lives alongside a 12-times-observed style preference; a model-confabulated pattern lives alongside a paper-grounded principle; a self-contradiction with an existing high-confidence entry is written silently next to its contradicted counterpart.
The A-MAC six-factor scoring (`future_utility` · `factual_confidence` · `semantic_novelty` · `temporal_recency` · `content_type_prior` · `environment_prediction_improvement`) with APPLY threshold ≥ 4.2/6 (post-iter-35 six-factor; was 3.5/5 in the pre-ECHO five-factor scoring) blocks roughly 35-45% of candidate writes in our trace. The rejected candidates are over-represented in the long-tail-noise distribution: 62% of rejections are single-observation anecdotes, 21% are model-confabulated patterns that contradict existing high-confidence entries, 11% are temporal-recency failures (the candidate is about a state of the world that has since changed), 6% are semantic-novelty failures (the candidate is already present in some tier under a different name). The 35-45% rejection rate is not a bug; it is the precision-recall tradeoff manifested at the admission boundary, and the signal-to-noise ratio of the tiers is preserved by it.
The naive instinct to "save everything in case it's useful later" is exactly the pattern that produces the week-three garbage warehouse.
RESULTS · §16 · COUNTERINTUITIVE FINDING 6 · SEMANTIC TIER HAS NO USEFUL DEFAULTS WITHOUT TOPIC TAGS. The semantic tier is the Chase & Simon chunked pattern store: ≤ 200 lines per file, ≤ 100 KB per file, one pattern per file, organized for retrieval by chunk recognition rather than full-text scan. The chunk-recognition retrieval mode requires a tag — without tags, the only retrieval path is full-text similarity against the chunk content, which collapses the semantic tier into a bucket-shaped baseline.
The bare schema `confidence: 0.5 · observations: 1` (the defaults that the legacy migration would have produced if it had set defaults at all) produces no actionable retrieval ordering: every file looks equally credible and equally observed. The S7 migration added `topic` tags (`style-tone`, `setting-call`, `prompt-injection-pattern`, `wrong-skill-invocation`, ...) plus inferred `confidence` from textual hedging signals plus counted `observations` from grep-similarity scores, and rebuilt `_MANIFEST.md` as a topic-indexed catalog that the cold-start hook H1 can page selectively. Post-S7, the cold-start cost is ~5 KB (`_MANIFEST.md` plus top-5 patterns by recency × confidence) rather than the full 244 KB of all 49 files, a −98% read reduction for the semantic tier specifically.
The tags are the most architecturally consequential metadata in the tier — without them the tier is a flat document store with the same retrieval properties as the single-bucket baseline.
RESULTS · §17 · COUNTERINTUITIVE FINDING 7 · ENVIRONMENT-DYNAMICS IS THE NEWEST AND MOST APPLICATION-SPECIFIC TIER. The environment-dynamics tier was added in iter-35 as an adapter for the Microsoft AI Frontiers ECHO pattern (workspace state transitions as first-class memory) and is the tier that grew fastest in iter-39 because it captures the only signal not covered by the Reflexion-Voyager-Sentra-ChaseSimon backbone — the agent's predictions about its own environment's state transitions, which is the precondition for proactive (rather than purely reactive) workspace behavior. The four sub-buckets are: `nour-response/` (how the operator responds to the agent's outputs along axes including tone, length, voice-to-text fragmentation, time-of-day, prior-correction-density), `system-state/` (the system's own self-described state: which crons are running, which hooks are firing, which env vars are set), `filesystem-response/` (how the filesystem responds to the agent's writes: which paths are write-locked, which directories grow under the agent's actions, which file-naming patterns get archived versus retained), `skill-output-shape/` (the distribution of output shapes per skill invocation: latency, length, structure, error rate), and `workspace-state-transitions/` (transitions of workspace state under agent actions: iter-N → iter-N+1 with the deltas codified).
At audit time `nour-response/` was populated (6 files), `system-state/` was minimal (1 file), and the three other sub-buckets were at zero files. The S11 population seeded each from iter-39 observations: filesystem-response with the `_v2-structure/` orphan-path pattern; skill-output-shape with the tweet-writer power-law invocation distribution; workspace-state-transitions with the iter-37 macro-folder rename and the iter-35 ECHO adapter addition. The tier is the most application-specific of the five because the environment is unique to the workspace — there is no generic environment-dynamics tier that ports across deployments — and it grows fastest because it captures the workspace's own self-knowledge, which has no other home in the architecture.

DISCUSSION · §18

Implication for framework design

The framework-level implication is that current agentic frameworks (LangChain memory primitives, LlamaIndex vector stores, CrewAI memory modules, AutoGen memory abstractions) ship the single-bucket pattern as default and require users to reinvent the tier separation themselves. We argue this is a framework-design failure: the typed-tier partition is a load-bearing architectural decision that should be a default rather than a custom-implementation cost. The five-tier SPEC documented here is the open reference: file-system-backed, paper-grounded, with explicit promotion paths, an A-MAC admission gate, and an `_AUDIT-2026-05-23.md` template that other workspaces can run on their own memory systems to surface the same failure modes we surfaced on ours. We expect future versions of the dominant frameworks to converge on tier separation — the empirical evidence is too strong for the bucket pattern to survive — but the convergence will take time, and in the interim, workspaces that adopt the five-tier pattern early will compound a learning advantage over workspaces that stay on the bucket.

DISCUSSION · §19

Limitations

(1) The audit is single-workspace: all numbers come from the Madani reference workspace, and the failure modes we surface may not generalize to workspaces with different load profiles (different operator volume, different task mix, different model class). The CoALA precedent suggests the structural argument generalizes, but the specific recall and precision deltas are workspace-specific. (2) The 40-task controlled comparison in §8 is small-n for the precision delta we report; the +2.4× precision gain is statistically robust at n = 40 but the confidence interval is wide and a larger replication is necessary before the magnitude can be cited as a general result. (3) The remediation track is fix-track validated but not yet stable-state validated — the post-S1-S12 state is fresh at audit close, and we do not yet have a long-horizon (>30 day) measurement of whether the cybernetic loop holds under operating drift. (4) The five-tier partition assumes file-system access from the agent runtime; in sandboxed environments (serverless functions with read-only filesystems, browser-based agents), the tier-as-folder pattern needs an alternative materialization (e.g., tier-as-API-namespace with the same SPEC enforced at the API layer). (5) The A-MAC scoring is manually weighted at the six factor level; we have not yet learned the factor weights from outcome data, and the current weights are uniform-plus-prior rather than regression-derived.

DISCUSSION · §20

Implication for the workspace agentic benchmark

The five-tier memory architecture is the load-bearing implementation of WAB Pillar 03 (Memory), and its iter-39 audit-and-remediation track is the canonical L3 → L4 progression for that pillar. L0 (no persistence; every session blank) is the default in current frameworks. L1 (a working memory file exists) is the typical hobbyist setup.
L2 (structured memory with documented retrieval) is the CoALA-level conceptual implementation. L3 (organization-wide memory primitives with compaction policy) is the pre-iter-39 Madani state: SPEC exists, tiers exist, but the cybernetic loop is broken in places. L4 (cybernetic memory with reflexion compaction and SNR-half-life monitoring) is the post-S1-S12 state: every tier has frontmatter canonical, the capability profile updates post-task, the admission gate filters writes, the audit overlay distinguishes prescriptive from descriptive episodic, and the environment-dynamics tier captures the workspace's own self-knowledge.
The audit document `_AUDIT-2026-05-23.md` is the reference template; any workspace can run the same audit on its own memory system and identify the same six failure-mode classes (broken ingestion, descriptive-not-prescriptive episodic, frozen capability profile, schema strictness violations, missing admission control, empty ambient-context sub-buckets) and adopt the same remediation track.

References

[1]
Tulving E. (1972), Episodic and Semantic Memory, in Organization of Memory, Academic Press.
[2]
Chase W.G. & Simon H.A. (1973), Perception in Chess, Cognitive Psychology 4, 55-81 (the 50,000-100,000 chunk estimate).
[3]
Newell A. & Simon H.A. (1972), Human Problem Solving, Prentice-Hall.
[4]
Anderson J.R. (1996), ACT: A Simple Theory of Complex Cognition, American Psychologist 51, 355-365 (procedural production systems).
[5]
Cover T. & Thomas J. (2006), Elements of Information Theory, 2nd ed., Wiley-Interscience (Data Processing Inequality, ch. 2).
[6]
Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K. & Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023, arXiv:2303.11366. open ↗
[7]
Wang G., Xie Y., Jiang Y., Mandlekar A., Xiao C., Zhu Y., Fan L. & Anandkumar A. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291. open ↗
[8]
Park J.S., O'Brien J.C., Cai C.J., Morris M.R., Liang P. & Bernstein M.S. (2023), Generative Agents: Interactive Simulacra of Human Behavior, arXiv:2304.03442. open ↗
[9]
Packer C., Wooders S., Lin K., Fang V., Patil S.G., Stoica I. & Gonzalez J.E. (2023), MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560. open ↗
[10]
Sumers T.R., Yao S., Narasimhan K. & Griffiths T.L. (2024), Cognitive Architectures for Language Agents (CoALA), TMLR, arXiv:2309.02427. open ↗
[11]
Wang C. & Shu Y. (2026), MetaCogAgent: Prospective Metacognition for Large Language Model Agents, arXiv:2605.17292. open ↗
[12]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, Stanford NLP, arXiv:2604.02460. open ↗
[13]
Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), NeurIPS 2025 Datasets and Benchmarks Track, arXiv:2503.13657. open ↗
[14]
ECHO Team (2026), ECHO: Workspace State Transitions as Ambient Agent Memory, Microsoft AI Frontiers, arXiv:2510.25863. open ↗
[15]
Mnemonic Sovereignty Authors (2026), Mnemonic Sovereignty: Nine Governance Primitives for Agent Memory, arXiv:2604.16548. open ↗
[16]
Anthropic (2025), Effective Harnesses: Prompt Caching, Stable Prefix Design, and Token Efficiency Patterns, Engineering Documentation.
[17]
Madani Lab (2026), Memory Engine 5-Tier Architecture · iter-39 Audit and Remediation Track · `_AUDIT-2026-05-23.md` (internal reference, MIT-licensed scheduled release).
[18]
Madani Lab (2026), Skill Auto-Curator (Hermes + SkillOS Adaptation), Workspace Agentic Benchmark Series, WSB-17 reference.
[19]
NousResearch (2026), hermes-agent: Open-Source Recursive Skill Iteration Pattern, GitHub.
[20]
Google Research (2026), SkillOS: A Skill Curation Operating System for Agentic Workspaces, Research Publication.
[21]
Honcho Authors (2024), Dialectic User Modeling for Conversational Agents, Open-Source Documentation.
[22]
Karpathy A. (2024), autoresearch: Self-Paced Autonomous Research Loops, Blog Post.

← back to all papersMadani Lab · WAB v0.3.4