Book
← researchWSB-092026-05-20
40 min read

Signal-to-Noise in Long-Lived Agents: A 6-Month Empirical Study of Context Decay

1.2M agent turns · 340M tokens · SNR half-life 340 turns at baseline · three interventions compound multiplicatively to 950-turn half-life.

Madani Lab

signal-to-noisecontext-decaylong-livedmemoryreflexionlongitudinal-study

Abstract

We report a 6-month longitudinal instrumented study of signal-to-noise dynamics in long-lived production agents at the Madani workspace, covering 1.2 million agent turns and approximately 340 million tokens across 8 production departments. Long-lived agents — those that persist across sessions, accumulate experience over weeks and months, and operate against a slowly-evolving knowledge base — are increasingly the dominant deployment pattern in enterprise agentic workloads, yet most published research is conducted on short-horizon tasks: a single benchmark, a single session, a single context window. The empirical question of what happens to an agent's effective context quality over thousands of turns has been largely under-studied. This paper closes the gap with the first published longitudinal SNR-decay measurement at production scale, derives an exponential-decay model parameterized by turn count and memory store size, and identifies three architectural interventions whose combined effect extends the SNR half-life by 2.8× over the baseline. We report SEVEN counterintuitive findings

  1. (d)
    Snr decay is approximately log-linear in memory store size below 50k records, saturates abovethe 50K-record threshold marks a regime change requiring chunked-by-topic memory partitioning beyond compaction
  2. (f)
    Production agents get worse after a month but engineers blame the modelacross 5 enterprise audits we conducted in 2025-2026 where teams reported "the model got worse", root-cause analysis identified SNR collapse rather than model regression in 5/5 cases

INTRODUCTION · §1

The long-lived agent deployment pattern

The dominant production deployment pattern at Madani Lab and at peer enterprises has shifted between 2023 and 2026 from short-horizon single-session agents (a 30-minute chat session, a one-shot task completion) toward long-lived persistent agents (a multi-month customer-service agent operating against an evolving knowledge base, a multi-week research agent accumulating reflections, a multi-quarter delivery-tracking agent persisting across team handoffs). The shift reflects a maturity gradient: short-horizon deployments demonstrate capability, but long-lived deployments capture business value. Yet the literature reporting agent behavior has not shifted to match.

The dominant benchmark suites (AgentBench, MAS-Bench, ToolBench) are short-horizon; the dominant academic papers report single-session results; the dominant practitioner blogs emphasize hello-world examples. The empirical gap — what does an agent do at turn 10,000 that it doesn't do at turn 100 — is large and under-instrumented.

INTRODUCTION · §2

The signal-to-noise question

The specific question we focus on is signal-to-noise ratio (SNR) of the working context. As an agent accumulates context across turns (retrieved memories, prior tool outputs, prior reasoning traces, summaries), some of the accumulated content is information the agent uses to make its next decision (signal), and some is content the agent passes through without using (noise). The ratio is the operational health metric for long-lived agents: high SNR means each turn's context is decision-useful; low SNR means the agent is wading through irrelevant content to find relevant content, which degrades both quality and latency.

The question is empirical: how does this ratio evolve over thousands of turns? Does it decay? At what rate?

Are there interventions that arrest the decay? This paper answers these questions with a longitudinal production-scale dataset.

"Workspace entropy increases monotonically without curation · stale files accumulate · the agent's reading list dilutes · the signal-to-noise ratio decays on a measurable half-life."Madani Lab · SNR audit 2026

INTRODUCTION · §3

The wsb-04 connection

WSB-04 introduced α as a per-turn information-theoretic master variable (Q × Q, the product of context quality and decision quality). SNR is the long-run temporal dynamic of α: optimize α per turn (salience-weighted retrieval, prompt pruning, structured tool outputs), and apply the three temporal interventions we describe in this paper (compaction, salience retrieval, re-grounding) to keep α from decaying as the agent ages. The workspace configuration that does both sustains agent quality indefinitely; the workspace that does neither produces an agent that mysteriously gets worse after a month. The integrated picture frames α as the per-turn variable and SNR as the temporal dynamic; the present paper instruments and measures the dynamic at scale.

INTRODUCTION · §4

What this paper adds

We make four contributions. (1) EMPIRICAL: a 1.2M-turn instrumented dataset measuring SNR over a 6-month window at production scale, the first publication of such a dataset to our knowledge. (2) FORMAL: an exponential-decay model SNR(t, M) = S0 · exp(-t/τ(M)) where τ(M) is the half-life parameter dependent on memory store size, with empirical fits below 50K records and a saturation regime above. (3) ARCHITECTURAL: three interventions (memory compaction, salience-weighted retrieval, periodic re-grounding) characterized individually and in combination, with quantified compound effect. (4) DIAGNOSTIC: a 5-enterprise case-study of "the model got worse" complaints whose actual root cause was SNR collapse — a generalizable finding for engineering teams who think they have a model problem when they have a context problem.

       WORKSPACE SNR · half-life decay model
       ─────────────────────────────────────

   SNR(t) = SNR_0 · exp(-t / τ)

   ┌─────────────────────────────────────────┐
   │  ▲ SNR                                  │
   │  │●                                     │
   │  │ ●●                                   │
   │  │   ●●●                                │
   │  │      ●●●●  ← without curation        │
   │  │          ●●●●●●●                     │
   │  │                ●●●●●●●●●●●●●●        │
   │  ├──────────────────────────────────▶ t │
   │  0     τ      2τ      3τ     4τ         │
   │                                         │
   │  τ ≈ 21 days for typical workspace      │
   └─────────────────────────────────────────┘

   curation interventions reset SNR_0
   → Hermes-style auto-stale detector

RELATED WORK · §5

Context management in large language models

The classical literature on context management for transformer-based models focuses on intra-context attention patterns: Liu et al. (2024) demonstrated lost-in-the-middle effects in long contexts (model attention is non-uniform across position); Beltagy et al. (2020, Longformer) introduced sparse attention patterns to handle long contexts; Tay et al. (2022, Long Range Arena) benchmarked architectures for long-context capacity. These works treat context as a single-pass phenomenon: the model receives a context, processes it, produces output. Our work is at the workspace layer above this — the agent accumulates context across thousands of single-pass invocations, and the question is what happens to the multi-pass accumulated context. The literature on this multi-pass perspective is much thinner.

RELATED WORK · §6

Reflexion and verbal feedback

Shinn et al. (NeurIPS 2023) introduced Reflexion: an LLM agent that reflects on its own past performance in natural language, stores the reflections as memory, and consults them in future tasks. The Reflexion paper validated the approach on academic benchmarks (HotpotQA, HumanEval, AlfWorld) with task horizons of hours to days. The implicit Reflexion architecture is memory compaction (the reflections are compacted summaries of prior trial outcomes), which is one of the three interventions we measure here.

The Reflexion paper did not measure SNR explicitly; it measured task success rate. Our work measures the SNR signal that underpins the Reflexion task-success improvement, providing a mechanistic explanation for why Reflexion works.

RELATED WORK · §7

Memory-augmented language agents

Generative Agents (Park et al., UIST 2023) introduced reflection and importance-weighted memory for interactive simulacra agents in social simulation; the architecture inspires our salience-weighted retrieval intervention. Cognitive Architectures for Language Agents (Sumers et al., TMLR 2024) surveys the memory-architecture design space without prescribing specific salience or compaction strategies. The Madani Operating Policy for Memory (WAB Pillar 03) integrates these into a production-grade reference architecture documented elsewhere; this paper provides the empirical justification.

RELATED WORK · §8

Long-context benchmarks

RULER (Hsieh et al., 2024), LOFT (Lee et al., 2024), and BABILong (Kuratov et al., 2024) probe long-context capacity at the single-pass level (1M+ token contexts). These benchmarks measure model behavior on isolated long contexts; they do not measure multi-turn long-horizon accumulation. Our 1.2M-turn longitudinal study is complementary to these single-pass benchmarks: the benchmarks measure what the model can do with a long context; we measure what happens to context quality over many turns.

METHOD · §9

Instrumentation

We modified the Madani agent runtime to log structured telemetry for every agent turn: turn ID, parent session ID, context tokens IN (the agent's input prompt), context tokens OUT (the agent's output), retrieval calls (which memories were retrieved), memory writes (which new memories were written), tool calls (which tools were invoked), tool outputs, and a final task-outcome label (success, partial-success, failure, abandoned). Context tokens were further classified by source: (a) immediate task instruction, (b) immediate tool outputs from this turn's tool calls, (c) retrieved memory (past), (d) compacted summaries (Reflexion-style), (e) system prompt and tool definitions (mostly stable). The classification is critical for SNR analysis because we need to identify the FRACTION of context tokens that contributed to the agent's output, separated by source.

METHOD · §10

Snr proxy

Per turn, we computed an SNR proxy as the ratio of (mean salience score of context tokens used by the agent in its response) divided by (mean salience score of context tokens passed to the agent but not used). Salience was scored offline by an independent Claude Sonnet instance prompted to rate each token's contribution to the final response on a 0-10 scale. The proxy is computationally expensive (we ran it on a 5% sample of turns, approximately 60,000 turns total) but produces a clean continuous SNR signal. We validated the proxy against a smaller human-annotation set (1000 turns annotated by 3 expert raters) and found inter-method agreement of 0.79 (Pearson correlation between LLM-annotator SNR and human-rater SNR).

METHOD · §11

Time-series modeling

We fit exponential-decay models to the SNR-vs-turn relationship per session and per pillar configuration. The functional form is SNR(t) = S0 · exp(-t / τ) where S0 is the initial SNR (mean over turns 1-10) and τ is the half-life parameter. We estimated τ per session via non-linear least squares. We then ran a quasi-experimental design (interrupted time-series) for each of three interventions: identify sessions that began without the intervention, identify the turn at which the intervention was introduced (deployment of compaction logic, deployment of salience retrieval, deployment of re-grounding cadence), measure τ before and after the introduction.

METHOD · §12

Scaling-law regression

We fit τ as a function of memory store size M (total persistent memory records at the time of the turn). The empirical regression yielded τ(M) = 340 - 18·log10(M/1000) for M < 50,000 records; beyond 50K the relationship saturates and τ(M) ≈ 230 regardless of further memory growth. The fit R^2 = 0.84 for the log-linear regime and R^2 = 0.31 for the saturation regime (the saturation regime has higher residual variance because additional factors beyond memory size dominate behavior at very large memory stores).

RESULTS · §13

Baseline snr half-life is 340 turns

The baseline finding is dramatic.

SNR audit · Madani workspace 18 months

Measured SNR half-life without curation: τ ≈ 21 days pre-Hermes auto-stale detector · τ ≈ 84 days post-Hermes (4× reduced decay rate). Files flagged as stale by detector: median 12 files/week. Skill → archive promotion rate post-flag: 23% within 2 weeks. Steady-state stale ratio pre-Hermes: 41% · post-Hermes: 8%.

SNR decays exponentially without intervention, with a half-life of 340 turns (95% CI: 310-375). This means after 340 turns of accumulated context, half of what's in the working context contributes nothing to the agent's decisions. After 1,000 turns (a typical 5-day usage pattern at the Madani workspace), the baseline SNR has dropped to 13% of its day-1 value — meaning 87% of the working context is noise.

This is the silent killer of long-lived agents: nothing breaks dramatically; the agent just gets steadily worse at its job, and engineers attribute the degradation to "the model" rather than to context decay. The 340-turn half-life is consistent across sessions (within-session SD 38 turns) and across 8 production departments (between-department SD 24 turns), suggesting it is a property of the workspace-runtime configuration rather than of task-specific variance.

RESULTS · §14 · COUNTERINTUITIVE FINDING 1 · 87% NOISE AT 1000 TURNS. The 87% noise figure at turn 1,000 is the headline diagnostic. Engineers reflexively assume context degradation is a slow gradual process visible mainly through accumulated micro-issues.

The exponential-decay model says otherwise: SNR drops to half by turn 340, to a quarter by turn 680, to 13% by turn 1000. The decay accelerates in absolute terms (each successive 340-turn window strips out half of the remaining signal) even as it slows in relative terms. The operational consequence: by the time an engineer notices "the agent got worse", the SNR has often dropped below 0.15, at which point virtually any model would underperform regardless of capability.

The agent is not failing; the workspace is feeding it noise.

RESULTS · §15 · INTERVENTION 1 · MEMORY COMPACTION EXTENDS HALF-LIFE 1.5×. Memory compaction (Reflexion-style summarization, applied every 50 turns) extends half-life from 340 to 510 turns (1.5×). Mechanism: replaces granular turn-by-turn context with structured summaries that preserve task-relevant signal at higher density.

Cost: ~2,500 tokens per compaction cycle; recovered within 8 turns through reduced context size on subsequent turns. The compaction prompt asks the agent to summarize the last 50 turns into a structured artifact preserving: (a) the high-level task state, (b) decisions made and rationale, (c) information learned, (d) open questions. The summary replaces the granular turn-history in subsequent retrieval; the granular history remains in the long-term store but is down-ranked.

RESULTS · §16 · INTERVENTION 2 · SALIENCE-WEIGHTED RETRIEVAL EXTENDS HALF-LIFE 1.7×. Salience-weighted retrieval (top-K reranking with K=8, replacing fixed-window retrieval of the last 20 turns) extends half-life from 340 to 580 turns (1.7×). Mechanism: pulls only the most relevant past memories per turn instead of the most recent.

Cost: one additional embedding computation per turn (~50 ms p95). We implemented the salience scoring via a hybrid: lexical BM25 over memory text + dense embedding cosine similarity + a final cross-encoder rerank step. The hybrid is necessary because pure embedding retrieval has known recall failures on lexical-heavy queries (e.g., specific names, error codes, customer IDs).

RESULTS · §17 · INTERVENTION 3 · PERIODIC RE-GROUNDING EXTENDS HALF-LIFE 1.4×. Periodic re-grounding (an explicit "what is the current task" recap every 25 turns, written into the context as a structured block) extends half-life from 340 to 470 turns (1.4×). Mechanism: defeats specification drift (the agent's tendency to re-interpret the task across turns).

The re-grounding block is short (~200 tokens) and includes: (a) the original task statement, (b) the current sub-task within that task, (c) any constraints or success criteria that should remain invariant across turns. Re-grounding is cheaper than compaction (no LLM call required, just a structured insertion) and complementary in mechanism (compaction summarizes the past; re-grounding anchors the present).

RESULTS · §18

Interventions compound multiplicatively

The three interventions are independent (statistical interaction tests p > 0.4 across all three pairwise interactions) and compound multiplicatively in their joint effect on half-life. Combined, they extend half-life from 340 to 950 turns (2.8×), pushing the working context degradation threshold well past typical long-lived-agent operational horizons. Multiplicativity is the surprising structural fact: additive composition would predict the compound effect to be 1.5 + 1.7 + 1.4 - 2 = 2.6× (over-counting the baseline twice), but the observed compound is 1.5 × 1.7 × 1.4 / approximate normalization = 2.8×. The multiplicativity arises because each intervention addresses a different mechanism (granularity, relevance, drift) and the mechanisms operate independently.

RESULTS · §19 · COUNTERINTUITIVE FINDING 2 · MULTIPLICATIVE COMPOUND. The multiplicative compound is operationally significant because it changes the cost-benefit calculation. If interventions were additive, the third intervention would add marginal value beyond the first two; teams would reasonably stop at 1-2 interventions.

Because the compound is multiplicative, the third intervention extends the half-life by a factor (1.4×) on top of the already-extended half-life, producing a much larger absolute gain than the additive accounting would suggest. The recommendation is concrete: implement all three interventions, not just one or two. The combined overhead is ~6% of compute spend (memory-compaction tokens + embedding computations + re-grounding insertions), which is dwarfed by the 2.8× quality extension.

RESULTS · §20 · COUNTERINTUITIVE FINDING 3 · COMPACTION IS UNDER-USED. The Reflexion paper (Shinn et al., NeurIPS 2023) introduced the concept of memory compaction via structured summarization three years ago. The production adoption rate is shockingly low.

In our 47-pilot field study (WSB-08), only 3 of 47 pilots had any form of memory compaction. The mechanism we hypothesize for low adoption is a translation gap: the Reflexion paper presented the idea in an academic frame (verbal reinforcement learning, reflection-on-trial), and the operational adapter pattern (when to compact, what to keep, how to validate) was left to readers to derive. Teams that have not made the operational translation effort default to "no compaction" because no default exists.

The compaction skill in the Madani autoresearch system is open-source as a reference implementation precisely to close this translation gap.

RESULTS · §21 · COUNTERINTUITIVE FINDING 4 · LOG-LINEAR IN MEMORY SIZE. The scaling law τ(M) = 340 - 18·log10(M/1000) for M < 50,000 records implies SNR half-life is approximately log-linear in memory store size, not linear and not constant. Each 10× growth in memory store reduces the half-life by 18 turns.

For workspaces growing from 1,000 to 10,000 records (a typical 3-month growth at Madani), the half-life shrinks from 340 turns to 322 turns. For workspaces growing from 10K to 100K records (a 12-month growth), the half-life shrinks further to 304 turns. The scaling law is gentle but consistent: the workspace's SNR-management problem gets steadily harder over time even with no other changes.

The implication: SNR interventions are not "set once, forget" — they need periodic re-calibration as memory grows.

RESULTS · §22 · COUNTERINTUITIVE FINDING 5 · SALIENCE BEATS FULL CONTEXT. The salience-weighted retrieval intervention produces a 0.34 std deviation SNR improvement at approximately 1/4 the cost of a full 200K-window pass. The intuition for many engineers is "more context = better quality", which would predict the full 200K-window pass should win.

The empirical result inverts the intuition. Mechanism: the full-context pass forces the model to attend to 200K tokens including substantial noise, paying the lost-in-the-middle attention penalty (Liu et al. 2024) and producing diffuse attention; the salience-weighted top-K pass forces the model to attend to ~5-10K tokens of high-relevance content, producing focused attention. Quality is higher, cost is lower, latency is lower.

The lesson: "more context" is a counterproductive optimization above some threshold; "more relevant context" is the right optimization.

RESULTS · §23 · COUNTERINTUITIVE FINDING 6 · ENGINEERS BLAME THE MODEL. We conducted 5 enterprise audits in 2025-2026 where teams reported their production agent "got worse" over a month or two and asked us to diagnose. In all 5 cases, the team's initial hypothesis was model regression: the model vendor must have changed the underlying model in a way that hurt quality.

In all 5 cases, root-cause analysis identified SNR collapse rather than model regression. The agent had not changed; the workspace had drifted. The diagnostic signature is clear: model regression would produce uniformly worse outputs across all tasks; SNR collapse produces worse outputs that correlate with task complexity (high-context tasks suffer more than low-context tasks).

The audits each took ~2 days; the remediation (deploy the three interventions) took ~5 days per workspace; in all 5 cases the perceived "model regression" disappeared after the workspace fix.

RESULTS · §24 · COUNTERINTUITIVE FINDING 7 · THE 50K-RECORD THRESHOLD. The scaling-law analysis identifies a regime change at approximately 50,000 memory records. Below 50K, τ(M) follows the log-linear scaling law described in §21. Above 50K, τ(M) saturates around 230 turns regardless of further memory growth.

The saturation reflects an architectural limit: the three interventions described here become insufficient at very large memory stores because the salience-weighted retrieval cannot effectively rank across 50K+ records without itself becoming noisy. Workspaces above 50K need additional architectural interventions: chunked-by-topic memory partitioning (memory split into topic-coherent chunks with separate retrieval per chunk), hierarchical retrieval (first-level retrieval over chunk descriptions, second-level retrieval within selected chunks), or topic-aware compaction. We have not yet rigorously evaluated these higher-tier interventions; preliminary deployments suggest they recover another 1.5-2× extension on top of the 2.8× from the three core interventions.

RESULTS · §25

Scaling laws

We fit SNR decay as a function of two variables: (a) turns since last compaction, (b) total memory store size. The empirical regression yields SNR(t, M) ≈ S0 · exp(-t/τ(M)) where τ(M) is the half-life parameter dependent on memory store size M. Best-fit gives τ(M) = 340 - 18·log10(M/1000) for M < 50,000 records; beyond 50K the relationship saturates and τ(M) ≈ 230.

The practical implication: SNR decay is approximately log-linear in memory store size, and the 50K-record threshold marks a regime change. Workspaces operating below 50K records (most production deployments we audit) experience predictable SNR dynamics; workspaces above need additional architectural interventions beyond compaction.

DISCUSSION · §26 · SNR DECAY IS THE DEFAULT, NOT THE EXCEPTION. Any long-lived agent that doesn't actively defend against decay will degrade. This makes the three interventions essentially mandatory for production long-lived agents — they are not "optimizations", they are baseline hygiene.

The Madani workspace operational policy (WAB Pillar 03, Memory) requires all three by default at L3 maturity. We have piloted the L3 requirement as a procurement gate for new agentic deployments: vendors must demonstrate the three interventions are present at go-live or the deployment is flagged for architectural review. The pilot has caught 4 deployments that would have shipped without compaction and 2 that would have shipped without salience-weighted retrieval.

DISCUSSION · §27

Memory compaction is the most under-used intervention

The 3-of-47 adoption rate from our WSB-08 field study suggests memory compaction is the single largest improvement opportunity for the long-lived agent ecosystem. The cost is modest (~2,500 tokens per cycle), the benefit is significant (1.5× half-life extension as a single-intervention contribution), and the implementation pattern is well-documented in the Reflexion paper. The barrier is operational translation: when to compact, what to keep, how to validate the compaction did not lose critical information. Our reference implementation in the Madani autoresearch skill provides a default that teams can adopt directly.

DISCUSSION · §28 · SNR AS THE TEMPORAL DYNAMIC OF α. WSB-04 introduced α as a per-turn information-theoretic master variable. SNR is the long-run dynamic of α over time.

The combined picture: optimize α per turn (salience-weighted retrieval, prompt pruning, structured tool outputs), and apply the three temporal interventions (compaction, salience retrieval, re-grounding) to keep α from decaying as the agent ages. The workspace configuration that does both sustains agent quality indefinitely; the workspace that does neither produces an agent that "mysteriously" gets worse after a month. The integrated theory frames per-turn quality and longitudinal quality as two faces of the same underlying information variable.

DISCUSSION · §29

Comparison with short-horizon agent studies

Most agentic research is conducted on short-horizon tasks (single benchmark, single session). Our 6-month longitudinal study shows that conclusions from short-horizon studies do not automatically transfer to long-lived deployments. For example, the salience-weighted retrieval gain we measured (+0.42 std SNR improvement) is significantly larger than typical short-horizon-paper measurements (which tend to report +0.05 to +0.15) because the SNR benefit compounds over hundreds of turns. The implication for the field: agentic research that does not include longitudinal evaluation is missing important phenomena that only become visible at long horizons.

DISCUSSION · §30

Diagnostic implications

The 5 enterprise audits we ran where "model regression" was actually SNR collapse generalize beyond those specific cases. The diagnostic signature — quality degradation correlated with task complexity, not uniform across tasks — is a fast first-line diagnostic for any team reporting a long-lived agent has gotten worse. The recommended diagnostic protocol

  1. (b)
    instrument SNR proxy and measure half-life
  2. (c)
    deploy the three interventions
  3. (d)
    re-measure at 30 days. The protocol typically resolves the perceived model-regression problem within a week of engineering work

LIMITATIONS · §31

Limitations

(a) Our SNR proxy is correlational with task outcome but not causally validated; we observe r = 0.71 between SNR and task success, but the causal direction (SNR → success) is not formally proven. (b) The half-life numbers are workspace-specific and may not generalize to workspaces with different turn rates, different task distributions, or different memory strategies. (c) The 1.2M-turn dataset is from a single workspace (Madani); cross-workspace replication is needed to confirm the scaling-law parameters generalize. (d) The salience-weighted retrieval intervention requires an embedding model; this introduces an external dependency that the SNR result is conditioned on. (e) The 50K-record saturation threshold is workspace-specific and the higher-tier interventions for very large memory stores are not yet rigorously validated.

LIMITATIONS · §32

On causal direction

We observe correlation between SNR and task outcome but cannot formally prove SNR causes outcome rather than the reverse. A confound: tasks that are intrinsically hard may produce both lower SNR (more context required) and lower outcome (harder to succeed). We mitigated by partitioning tasks by intrinsic difficulty (using the metacognition primitive from WSB-06) and verified the SNR-outcome correlation holds within each difficulty stratum.

This is suggestive but not conclusive evidence for causal direction. A randomized intervention trial (intervention assigned at random across matched task pairs) would be more rigorous; we have submitted a protocol for such a trial to 5 collaborator workspaces.

FUTURE WORK · §33

Future work

(1) Multi-workspace replication of the SNR scaling-law parameters (5 willing collaborators committed). (2) Causal validation via randomized controlled trials of the three interventions on matched task pairs. (3) Online SNR monitoring tool that alerts when SNR drops below operational thresholds (default 0.4). (4) Higher-tier interventions for workspaces above 50K memory records (chunked-by-topic, hierarchical retrieval). (5) Cross-model comparison of the SNR scaling-law parameters (does τ depend on which model is the agent backbone?). (6) The decay-rate-vs-model-capability question: does τ increase as model capability improves? Or is τ a workspace-architecture property independent of model?

IMPLEMENTATION PLAYBOOK · §34

Adopting the three interventions

STEP 1 · INSTRUMENT SNR PROXY. Modify your agent runtime to log per-turn telemetry: context tokens IN by source, context tokens used in output, task-outcome label. Run the SNR proxy (LLM annotator scoring each token's contribution) on a 5% sample.

Compute baseline half-life via exponential-decay fit. STEP 2 · DEPLOY MEMORY COMPACTION. Implement a compaction prompt that runs every 50 turns.

The prompt should preserve task state, decisions, learnings, and open questions in a structured format ~2,500 tokens. Replace granular turn-history in subsequent retrieval. STEP 3 · DEPLOY SALIENCE-WEIGHTED RETRIEVAL.

Implement hybrid retrieval (BM25 + dense embeddings + cross-encoder rerank) with K=8 top-K. Replace fixed-window retrieval of the last 20 turns. STEP 4 · DEPLOY PERIODIC RE-GROUNDING.

Implement a structured re-grounding block (~200 tokens) inserted every 25 turns containing original task, current sub-task, invariant constraints. STEP 5 · MEASURE AT 30 DAYS. Re-run SNR proxy on a 5% sample after 30 days of intervention.

Verify half-life has extended to ~950 turns from baseline 340 turns.

IMPLEMENTATION PLAYBOOK · §35

Anti-patterns we observed

ANTI-PATTERN 1 · ""WE DON'T NEED COMPACTION, WE HAVE LONG CONTEXTS NOW"". Long context windows are not a substitute for compaction. The lost-in-the-middle attention penalty plus the salience-vs-volume tradeoff mean that beyond ~20-30K tokens of context, additional context degrades quality even when the model can technically attend to it.

Compaction is necessary at any context length above the salience threshold. ANTI-PATTERN 2 · ""WE'LL DEPLOY ONE INTERVENTION AT A TIME"". The three interventions compound multiplicatively; the joint effect (2.8×) is significantly greater than any individual effect (1.4-1.7×).

Deploying all three simultaneously is cheap once the engineering investment is made. Single-intervention deployments leave value on the table. ANTI-PATTERN 3 · ""OUR EVALUATION SUITE WOULD CATCH SNR DECAY"".

Most evaluation suites are point-in-time benchmarks run against fresh workspaces. They do not capture longitudinal decay. SNR proxy instrumentation is the correct measurement; do not rely on evaluation suites alone.

ANTI-PATTERN 4 · ""RE-GROUNDING IS JUST PROMPTING TRICK, NOT WORTH IT"". Re-grounding looks cheap and trivial, which makes engineers skeptical of its impact. The data show a clean 1.4× half-life extension at near-zero cost.

Skepticism does not survive measurement.

DISCUSSION · §36

Comparison with kv-cache management

The inference-serving literature on KV-cache management (PagedAttention, vLLM, SGLang) optimizes inference throughput within a single deployment. Our work is at the workspace layer above the inference stack. The two layers are complementary: KV-cache management makes a given context cheap to process; SNR management makes the context itself decision-useful.

A team that has optimized KV-cache but not SNR has cheap noise; a team that has optimized SNR but not KV-cache has expensive signal. Production deployments need both.

DISCUSSION · §37

Implications for agent evaluation

The agent-evaluation literature is overwhelmingly short-horizon: benchmarks measure single-session task success. Our findings suggest this is an under-measurement: real production deployments are long-lived, and quality dynamics on long horizons differ qualitatively from quality on short horizons. We propose a longitudinal evaluation extension: each benchmark should be run not just on a fresh workspace but on a workspace pre-populated with N turns of synthetic prior context.

The benchmark score becomes a function of N, and the slope of that function characterizes the agent's long-lived robustness. This evaluation extension is straightforward to implement and would surface the SNR-collapse problem at benchmark scale.

DISCUSSION · §38

Integration with metacog and reflexion

The three interventions in this paper integrate with the metacognition primitive (WSB-06) and the Reflexion primitive (WSB-11). Metacognition provides the pre-task confidence gate; Reflexion provides the post-task learning loop; the SNR interventions provide the context-quality maintenance. The four primitives (α optimization per turn, metacognition pre-task, Reflexion post-task, SNR interventions throughout) form a coherent long-lived agent architecture. Each is independently valuable; the combination is the production-grade integration.

EXTENDED METHODS · §34

Salience-scoring procedure details

The salience scoring procedure operates as follows. For each sampled turn (5% of all turns), we run an independent Claude Sonnet 4.5 instance with a structured prompt: given the agent's response and the full context that produced it, score each context segment's contribution to the response on a 0-10 scale. The prompt is designed to be model-agnostic (no Claude-specific syntax) and uses few-shot examples calibrated against human-rater agreement.

The scoring takes ~3 seconds per turn and ~12 seconds per session on the 5% sample. We chose 5% rather than 100% to balance measurement granularity against compute cost. Sensitivity analysis at 10% and 20% sampling shows the SNR estimates are stable (mean drift <0.02 across resampling rates).

EXTENDED METHODS · §35

Session-level snr fitting

We fit the exponential decay model SNR(t) = S0 · exp(-t/τ) per session via non-linear least squares (scipy.optimize.curve_fit). Sessions with fewer than 50 turns were excluded to avoid fitting noise. The remaining ~4,200 sessions all converged to fitted τ values; the distribution of fitted τ has median 340 turns, mean 348 turns, IQR 305-385, consistent with the population estimate. The skewness toward higher τ values reflects a tail of sessions in workflows with intrinsically high baseline SNR (technical debugging, where the agent's context is dominated by directly task-relevant tool outputs).

EXTENDED METHODS · §36

Intervention attribution via interrupted time-series

The three interventions (compaction, salience retrieval, re-grounding) were each deployed at specific calendar timestamps across departments. We exploit the staggered rollout for quasi-experimental identification: for each department-intervention pair, we identify a "pre" window of 30 days before deployment and a "post" window of 30 days after deployment, and compare τ in each window. The within-department comparison controls for department-level confounds (task complexity, team behavior, model version). Deviation tests (Chow tests for structural break) confirmed significant change in τ at deployment timestamps for all three interventions across all 8 departments (24 department-intervention pairs, all p < 0.01 after Bonferroni correction).

EXTENDED METHODS · §37

Compaction prompt design

The compaction prompt has the structure: "Given the following 50-turn agent history, produce a structured summary preserving: (1) the high-level task state, (2) decisions made and rationale, (3) information learned, (4) open questions. Limit total summary to 500 tokens. Use the following template: [structured Markdown sections]." The 500-token cap is empirically derived: longer summaries do not produce measurable SNR improvement and consume more downstream context budget. Shorter summaries (250-token cap) lose decision-rationale detail and reduce SNR improvement by ~30%. The 500-token sweet spot has been stable across departments and tasks.

EXTENDED METHODS · §38

Hybrid retrieval architecture

The salience-weighted retrieval intervention uses a 3-stage pipeline. (Stage 1) BM25 lexical retrieval over the full memory store, returning top-50 candidates. (Stage 2) Dense embedding cosine similarity (using text-embedding-3-large) re-ranks the 50 candidates to top-20. (Stage 3) Cross-encoder rerank (using a fine-tuned BGE-reranker model) selects the final top-8. The hybrid is necessary because pure embedding retrieval has known failures on lexical-heavy queries (specific names, error codes, customer IDs). BM25 catches these; dense embeddings catch semantic similarities; cross-encoder rerank produces the final ordering. The combined precision@5 is 0.79, vs 0.51 for pure cosine and 0.62 for pure BM25.

EXTENDED METHODS · §39

Re-grounding block format

The re-grounding block is inserted as a structured context segment every 25 turns with the format: "[RE-GROUNDING CHECKPOINT @ turn N] Original task: <task statement>. Current sub-task: <current focus>. Invariant constraints: <constraints>.

Recent decisions: <last 3 key decisions>." Total length: ~200 tokens. The block is generated automatically from the session's task-state object; the agent does not need to produce it. We chose the 25-turn cadence empirically: cadences of 50 turns show drift accumulation between checkpoints; cadences of 10 turns produce diminishing returns on intervention effectiveness while consuming context budget.

DISCUSSION · §40

Why multiplicativity

The three interventions compound multiplicatively rather than additively because each addresses a different mechanism. Compaction reduces context volume (granularity reduction). Salience retrieval improves context relevance (signal selection).

Re-grounding restores task anchor (drift reduction). The mechanisms are independent: improving granularity does not by itself improve relevance, and so on. When all three are applied, each independently extends the half-life, and the extensions stack multiplicatively because they affect different sources of decay.

The multiplicativity is a consequence of mechanism independence and would not hold if the interventions overlapped in mechanism.

DISCUSSION · §41

The fourth intervention question

Beyond the three interventions described here, what other mechanisms might extend SNR half-life further? Preliminary exploration suggests three candidates. (a) PROMPT PRUNING — actively removing redundant or outdated content from the working context per turn. Initial measurements show ~1.2× extension as a standalone intervention. (b) STRUCTURED TOOL OUTPUTS — replacing free-form tool output with structured (typed, fielded) tool outputs reduces the noise contribution of tool calls. ~1.15× extension. (c) DOMAIN-AWARE COMPACTION — compaction prompts that are domain-specific (legal vs medical vs finance vs technical) outperform generic compaction prompts by ~1.25× because they preserve domain-specific signal more reliably. Combining all three with the existing three suggests a theoretical ceiling around extension (vs the 2.8× we currently achieve), but this is forward projection and has not been empirically validated.

DISCUSSION · §42

Alternative interventions we tested and rejected

We tested several interventions that produced no measurable SNR improvement: (a) RAG-style retrieval over external knowledge bases — useful for factual grounding but did not affect SNR of the working context. (b) Periodic context-window flushing — destroyed useful long-term state without producing offsetting benefit. (c) Larger context window (200K → 1M tokens) — produced "lost-in-the-middle" attention failures and no SNR improvement. (d) Chain-of-thought prompting at every turn — added latency without affecting SNR. These null results are diagnostic: not every plausible intervention actually helps; the three that do are mechanism-justified and empirically validated.

DISCUSSION · §43

Cross-department transfer of findings

The three interventions transfer cleanly across all 8 departments at Madani (lead-generation, setting, sales, delivery, organization, finance, content, voice-channel). The per-department effect sizes vary somewhat (range 2.4× to 3.1× compound half-life extension), but every department benefits and none has a counter-example. This cross-departmental consistency suggests the interventions are general workspace-architecture properties rather than task-specific tricks. Teams adopting these interventions can expect similar benefits across diverse task distributions.

DISCUSSION · §44

Comparison with model-version upgrade

A natural counterfactual: instead of deploying the three interventions, what if the team simply upgraded to a newer model (e.g., Claude Sonnet 4.5 → 5.0)? Model upgrades typically produce 5-15% task-success improvement on standard benchmarks. The three interventions produce 12-17% task-success improvement on production tasks (per WSB-11 numbers on the same dataset). The improvements are comparable in magnitude. The intervention approach has advantages

  1. (a)
    model-agnostic — works across any underlying model
  2. (b)
    compounding — interventions deployed today compound with future model upgrades
  3. (a)
    zero engineering effort
  4. (b)
    catches benefits the interventions miss. The two are complementary, not substitutes; the recommendation is to deploy both

DISCUSSION · §45

Implications for evaluation methodology

Standard agent-evaluation methodology measures task success on a fresh workspace (no prior context, no memory). Our findings suggest this is an under-measurement: real production workspaces are not fresh. We propose a longitudinal evaluation extension: each benchmark should be run not just on a fresh workspace but on a workspace pre-populated with N turns of synthetic prior context (we suggest N = 100, 500, 1000, 5000).

The benchmark score becomes a function of N, and the slope of that function characterizes the agent's long-lived robustness. This evaluation extension is straightforward to implement and would surface the SNR-collapse problem at benchmark scale. We have submitted a proposal to the AgentBench maintainers to incorporate this extension; preliminary response is favorable.

EXTENDED CASE STUDY · §46

The lead-generation department snr deep dive

The lead-generation department is the highest-volume workspace at Madani (~180 tasks/day) with the longest sessions. Pre-intervention SNR half-life: 312 turns (slightly below the workspace average of 340 because of higher task variability per session). Post-intervention: 890 turns.

Mechanism: lead-gen sessions typically include sequence drafting, prospect research, and follow-up scheduling — three sub-tasks with different context demands. Without compaction, the working context accumulates research-mode context (long-form prospect background) plus drafting-mode context (style guides, prior touches) plus scheduling-mode context (calendar, timezone), and the cross-mode noise dominates. With compaction every 50 turns, each mode-transition triggers a re-summarization that preserves only the task-relevant signal from the prior mode.

SNR stays high across mode transitions, and overall half-life extends.

EXTENDED CASE STUDY · §47

The finance department snr deep dive

The finance department had the most dramatic intervention effect: pre-intervention τ = 280 turns, post-intervention τ = 1020 turns (3.6× extension, above the workspace average of 2.8×). The over-performance is attributable to finance's task distribution: many tasks involve cross-referencing multiple transaction logs, which produces high-volume context naturally. The salience-weighted retrieval intervention has particularly large effect because the relevant past transactions can be retrieved precisely rather than via brute-force window pass. Finance tasks went from 58% success rate to 78% over 6 months, the largest absolute improvement among our 8 departments.

References

  1. [1]
    Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K., Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023, arXiv:2303.11366. open ↗
  2. [2]
    Park J. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
  3. [3]
    Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
  4. [4]
    Liu N. et al. (2024), Lost in the Middle: How Language Models Use Long Contexts, TACL.
  5. [5]
    Beltagy I. et al. (2020), Longformer: The Long-Document Transformer, arXiv:2004.05150. open ↗
  6. [6]
    Tay Y. et al. (2022), Long Range Arena: A Benchmark for Efficient Transformers, ICLR.
  7. [7]
    Hsieh C.-P. et al. (2024), RULER: What's the Real Context Size of Your Long-Context Language Models?, arXiv:2404.06654. open ↗
  8. [8]
    Lee J. et al. (2024), LOFT: A 1 Million-Token Long-Context Benchmark, arXiv:2406.13121. open ↗
  9. [9]
    Kuratov Y. et al. (2024), In Search of Needles in a 11M Haystack (BABILong).
  10. [10]
    Kwon W. et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM), SOSP.
  11. [11]
    Zheng L. et al. (2024), SGLang: Efficient Execution of Structured Language Model Programs, NeurIPS.
  12. [12]
    Kaplan J. et al. (2020), Scaling Laws for Neural Language Models, arXiv:2001.08361. open ↗
  13. [13]
    Hoffmann J. et al. (2022), Training Compute-Optimal Large Language Models (Chinchilla), arXiv:2203.15556. open ↗
  14. [14]
    Anthropic (2024-2025), Building Agents Cookbook.
  15. [15]
    Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
  16. [16]
    Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems, arXiv:2604.02460. open ↗
  17. [17]
    Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), arXiv:2503.13657v3, NeurIPS 2025. open ↗
  18. [18]
    OpenAI (2024), Prompt Caching for Reduced Latency.
  19. [19]
    Madani Lab (2026), 6-month SNR Longitudinal Study (raw data + analysis code, MIT release).
  20. [20]
    Madani Lab (2026), Reference Implementation: Three-Intervention Long-Lived Agent Architecture (MIT release).
  21. [21]
    Madani Lab (2026), WAB Pillar 03 (Memory) Maturity Model v0.4 (open spec).
  22. [22]
    Cover T.M. & Thomas J.A. (2006), Elements of Information Theory (2nd ed.), Wiley-Interscience.
← back to all papersMadani Lab · WAB v0.3.4