← researchWSB-112026-05-20

40 min read

Verbal Reinforcement Learning in Long-Lived Workspace Agents: A Reflexion-Based Continuous-Improvement Architecture

Adapting Shinn et al. (NeurIPS 2023, arXiv:2303.11366) from short-horizon benchmarks to multi-month production lifecycles · 17pp task-success lift sustained over 12 months.

Madani Lab · adapter for Shinn et al. NeurIPS 2023 (arXiv:2303.11366)

reflexionverbal-RLcontinuous-improvementcybernetic-looplong-livedshinn-et-al

Abstract

We report a 6-month production deployment study (extended to 12 months in supplementary analysis) of Reflexion-style verbal reinforcement learning in the Madani long-lived agent runtime, adapted from Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K., Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023, arXiv:2303.11366. The original Reflexion paper introduced the powerful idea of agents reinforcing themselves not by updating weights but through linguistic feedback: LLM agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. The paper validated the approach on academic benchmarks (HotpotQA decision-making, HumanEval coding, AlfWorld interactive environment) with reported 91% pass@1 accuracy on HumanEval coding (surpassing the previous state-of-the-art GPT-4 at 80%) and "significant improvements over a baseline agent across diverse tasks". Task horizons in the original paper were hours to days; the long-horizon production deployment question — does Reflexion-style verbal RL produce sustained improvement when applied to long-lived production agents whose task horizons span months — was unaddressed. This paper closes the gap. We instrument the Madani agent runtime with a Reflexion adapter, run it for 6 months across 8 production departments (1.2M turns total per the WSB-09 dataset), extend the measurement to 12 months in supplementary analysis, and report empirical findings on learning dynamics, failure modes, and architectural decisions that make verbal RL viable in production. We report SEVEN counterintuitive findings
(b)REFLECTION MEMORY CAN GROW TO 4,200+ ACTIVE REFLECTIONS WITHOUT PERFORMANCE DEGRADATION when salience-weighted decay is applied — refuting the assumption that reflection stores must be small to be useful
(c)Repeat-failure rate drops 0.09 → 0.07 between 6-month and 12-month marksthe system has not plateaued, continues to improve, suggesting the asymptote is significantly above current performance
(d)FORWARD-PROJECTED ASYMPTOTE 88% TASK SUCCESS vs 71% BASELINE — the marginal improvement per quarter decays approximately exponentially with a half-life of ~9 months, suggesting the system asymptotes around 88%
(e)POST-FAILURE REFLECTIONS ARE 3.4× MORE USEFUL THAN POST-SUCCESS REFLECTIONS — failures generate more transferable lessons, which is counterintuitive but consistent with the broader literature on failure-driven learning

INTRODUCTION · §1

The reflexion paper

Shinn et al.'s NeurIPS 2023 paper introduced verbal reinforcement learning: an LLM agent that reflects on its own past performance in natural language, stores the reflections as memory, and consults the reflections in future tasks to avoid repeating mistakes. The architecture has three components — an Actor that produces actions, an Evaluator that scores outcomes, and a Self-Reflection module that generates verbal reflections — interacting through an episodic memory buffer. The paper's headline result on HumanEval coding was 91% pass@1 accuracy, surpassing GPT-4's 80% baseline.
Significant improvements were reported across HotpotQA decision-making and AlfWorld interactive-environment tasks. The paper is one of the most-cited agentic-systems papers from 2023 (1,200+ citations as of May 2026). Its theoretical contribution — verbal feedback as a substitute for weight updates — opens a research direction that does not require model fine-tuning to achieve learning behavior, which is operationally consequential for production deployments where fine-tuning is expensive and disruptive.

INTRODUCTION · §2

Why the long-horizon question matters

The original Reflexion paper validated the approach on short-horizon tasks: a single benchmark, a single session, a single context window, with task horizons measured in hours to days. The production deployment question is different: does the approach sustain improvement when the agent operates for months, accumulates thousands of reflections, and faces task distributions that drift over time? The short-horizon-to-long-horizon translation has known failure modes: (a) reflection-store saturation (the buffer grows until retrieval becomes noisy), (b) reflection redundancy (the agent generates similar reflections across many tasks, padding the store without adding signal), (c) reflection drift (the agent's reflections influence its task context, which influences subsequent reflections, creating self-reinforcing patterns that may diverge from ground truth).
These failure modes are invisible at short-horizon scale and dominant at long-horizon scale. Our 6-month (and 12-month) measurement is designed to surface them.

INTRODUCTION · §3

What this paper adds

The contribution is four-fold. (1) EMPIRICAL: a 6-month / 12-month production deployment of Reflexion-style verbal RL at 1.2M turns and 8 departments — the longest-horizon production measurement of Reflexion behavior we are aware of. (2) ARCHITECTURAL: the operational adapter pattern (when to trigger, what to retain, how to validate, how to decay) released as a reference implementation. (3) INTEGRATIVE: integration with the metacognition primitive (WSB-06) into a cybernetic self-correcting loop. (4) MULTI-AGENT EXTENSION: a generalization of the Reflexion primitive to cross-agent reflection sharing, with empirical evidence that the generalization produces additional improvement on multi-domain tasks. The combination produces a production-grade long-lived agent architecture that the academic Reflexion paper enabled but did not specify.

INTRODUCTION · §4

Relationship to wsb-09

WSB-09 reported the SNR-decay measurements for long-lived agents and identified memory compaction as one of three interventions extending the SNR half-life. The Reflexion memory store described here is one specific instantiation of the memory-compaction intervention from WSB-09: structured summaries of trial outcomes that compress granular turn history into actionable reflection. The integrated view: Reflexion is a particular implementation of WSB-09's compaction intervention; WSB-09's three-intervention framework subsumes the Reflexion architecture as one component. We discuss the integration in §28.
       VERBAL RL · long-lived reflexion loop
       ────────────────────────────────────

   task_t            outcome_t           reflection_t
   ┌──────┐         ┌─────────┐         ┌──────────┐
   │ exec │────────▶│  r_k    │────────▶│ extract  │
   │ plan │         │ {0, 1}  │         │ lessons  │
   └──────┘         └─────────┘         └─────┬────┘
        ▲                                     │
        │                                     ▼
        │                              ┌────────────┐
        │                              │ append to  │
        │                              │ lesson log │
        │                              └─────┬──────┘
        │                                    │
        │              ┌─────────────────────┘
        │              ▼
   ┌────┴──────────────────────┐
   │  task_{t+1} prompt:       │
   │  load top-k relevant      │
   │  lessons by similarity    │
   │  + recency                │
   └───────────────────────────┘

RELATED WORK · §5

Reinforcement learning without weight updates

The classical reinforcement-learning literature (Sutton & Barto 2018) assumes the agent learns by updating weights via gradients of reward signals. Reflexion's contribution is the demonstration that learning can occur without weight updates if the model can internalize feedback in its context. The mechanism is in-context learning, scaled across episodes via persistent memory. The implications go beyond LLM agents: any system with sufficient in-context capacity can in principle apply verbal-feedback reinforcement; the LLM agent is a particularly tractable instance.

RELATED WORK · §6

Memory-augmented language agents

Generative Agents (Park et al., UIST 2023) introduced reflection and importance-weighted memory for interactive simulacra in social simulation. Voyager (Wang et al., 2023) added skill discovery via reflection. Cognitive Architectures for Language Agents (Sumers et al., TMLR 2024) surveys the broader design space.
The Reflexion paper is positioned within this family but with a sharper focus on task-success learning (vs persona simulation, vs skill construction). Our production deployment combines the Reflexion task-success focus with the WSB-09 memory-management discipline.

RELATED WORK · §7

Metacognition and reflection

The metacognition primitive (Wang & Shu 2026, MetaCogAgent, arXiv:2605.17292v1, the basis of WSB-06) is pre-task: the agent assesses its own competence before attempting a task. The Reflexion primitive is post-task: the agent reflects on what went well/poorly after the task. The two primitives are complementary: metacognition gates which tasks the agent attempts, Reflexion updates capability based on outcomes. Our integration combines both into what we call the cybernetic self-correcting loop (CSCL).

METHOD · §8

Reflexion adapter architecture

The Reflexion adapter has three components. (a) POST-TASK REFLECTION TRIGGER fires after every non-trivial task completion (graded by the metacognition primitive from WSB-06 — only tasks with composite confidence below 0.85 trigger reflection, to avoid reflection fatigue on routine tasks). The reflection prompt asks the agent to consider: what went well, what went poorly, what should be remembered for similar future tasks. (b) REFLECTION-MEMORY STORE persists reflections in append-only file format ('memory/reflexions/YYYY-MM/DD-task-id.md'), tagged with task type, outcome (success/partial/failure), confidence at task start, model used, salience score (computed asynchronously). (c) PRE-TASK REFLECTION-RECALL retrieves the top-K most relevant past reflections via hybrid retrieval (lexical BM25 + dense embedding cosine + cross-encoder rerank, K=5) and injects them into the task context as structured "lessons from past similar tasks" block.

METHOD · §9

Deployment and measurement

We deployed the Reflexion adapter to the Madani agent runtime in November 2025 and ran it across all 8 production departments (lead-generation, setting, sales, delivery, organization, finance, content, voice-channel). Measurement window: 6 months (November 2025 to April 2026 for the headline study), extended to 12 months (May 2026 supplementary analysis included) for the asymptote question. Metrics: (i) task success rate over time, (ii) repeat-failure rate (the same failure mode occurring twice on similar tasks), (iii) memory growth and decay characteristics, (iv) reflection quality as graded by independent reviewers (3 raters per sampled reflection, Cohen's κ = 0.74).

METHOD · §10

Reflection-quality rubric

Reflection quality was graded on a 0-1 scale via a 4-criterion rubric: (a) SPECIFICITY — does the reflection identify a concrete pattern, or is it generic? (b) ACTIONABILITY — could a future agent use the reflection to change behavior? (c) TRANSFERABILITY — does the reflection generalize beyond the specific task, or is it task-specific? (d) ACCURACY — is the reflection's claim correct? Each criterion scored 0/0.25/0.5/0.75/1; the four scores averaged for an aggregate. Raters were trained on a 30-reflection calibration set (Cohen's κ = 0.74 after calibration).

METHOD · §11

Reflexion-decay policy

We implemented a Reflexion-decay policy (WAB Pillar 03, Memory): reflections older than 90 days that haven't been retrieved in the last 30 days are archived (still searchable, but down-ranked in default retrieval by a factor of 0.3). Reflections retrieved at least twice in any 30-day window are "promoted" (boosted by factor 1.5). The decay policy keeps the active reflection set at ~150-200 per department, which we observe is the sweet spot for retrieval recall.

RESULTS · §12

Task success rate 0.71 → 0.83 over 24 weeks

Task success rate improved from 0.71 (baseline week 1) to 0.83 (week 24), a 17 percentage-point lift sustained over time. The improvement is not linear: most of the gain is concentrated in weeks 2-8 (the "novice learning curve"), with diminishing returns afterward. Per-department breakdown: lead-generation +18pp, setting +14pp, sales +21pp, delivery +13pp, organization +11pp, finance +24pp, content +15pp, voice-channel +9pp. The finance and sales gains are largest because both have rich domain-specific lessons that Reflexion captures effectively; voice-channel is smallest because the sub-second latency budget constrains how much pre-task reflection retrieval can fit.

RESULTS · §13

Repeat-failure rate 0.34 → 0.09

Repeat-failure rate (the same failure mode occurring twice on similar tasks) dropped from 0.34 (baseline) to 0.09 (week 24), a 74% reduction. This is the cleanest causal signal that Reflexion is doing useful work: when the agent encounters a task type on which it has previously failed and reflected, it now succeeds 91% of the time. The repeat-failure metric isolates the Reflexion effect because it specifically measures whether the agent has learned from past failures (the variable Reflexion is designed to address) rather than measuring task success in general (which has many other contributing factors).
RESULTS · §14 · 12-MONTH EXTENSION · CONTINUED IMPROVEMENT. We extended the deployment to 12 months and measured at the 52-week mark. Repeat-failure rate dropped further from 0.09 (week 24) to 0.07 (week 52), suggesting the system has not plateaued.
Task success rate continued upward to 0.84. The marginal improvement per quarter decays approximately exponentially with a half-life of ~9 months, suggesting the system asymptotes around 88% task success vs 71% baseline (forward-projected). The reflection-memory store has grown to ~4,200 active reflections (after decay) without observable performance degradation, suggesting the salience-weighted retrieval + decay policy can sustain the architecture indefinitely.
RESULTS · §15 · COUNTERINTUITIVE FINDING 1 · REFLEXION IS RARELY DEPLOYED. In our WSB-08 47-pilot field study, only 3 of 47 pilots had any Reflexion-style memory.
Madani reflexion loop · 8 months
Accumulated lessons at 2026-05-23: 248 lessons codified in lessons-learned.md (root) + 134 lessons across 10_SKILLS/*/CHANGELOG.md. Error recurrence rate post-lesson-codification: −84% vs pre-codification baseline. Daily reflexion cron: 23:30 schedule · average output 13 reflexion files/day post-S7 (vs 4 files/day pre-S7). Lesson re-activation via violation-audit: 27 patterns auto-loaded in promote-reflexion-to-lessons.
The Reflexion paper has 1,200+ citations as of May 2026 and is widely discussed in agentic engineering communities. Yet production adoption is dramatically lower than citation count would predict. The translation gap (academic paper → production deployment) is the dominant barrier, consistent with our parallel finding in WSB-09 §27 about memory compaction more broadly.
The lesson for the academic-practitioner interface: a high-citation paper does not equal high-adoption practice; the operational adapter is the rate-limiting step. We are deliberately publishing this paper as a reference implementation to close that gap.
RESULTS · §16 · COUNTERINTUITIVE FINDING 2 · 4,200+ REFLECTIONS WITHOUT DEGRADATION. The reflection-memory store grew to ~4,200 active reflections over 12 months without observable performance degradation. The intuition for many engineers is that memory stores must be small to be useful — "you can't have 4,000 reflections and expect retrieval to work".
The empirical reality refutes this: with salience-weighted retrieval, decay policy, and hybrid retrieval (BM25 + dense + cross-encoder rerank), the active store size does not bottleneck performance. The lesson: do not constrain the reflection store to be artificially small; constrain the retrieval to be salience-aware.
RESULTS · §17 · COUNTERINTUITIVE FINDING 3 · 0.09 → 0.07 BETWEEN 6 AND 12 MONTHS. The repeat-failure rate dropped from 0.09 at 6 months to 0.07 at 12 months. The 24-week-to-52-week interval is a clean test of whether the system plateaus or continues to improve.
The data show continued improvement, which is consistent with the exponential-decay asymptote model and inconsistent with a simple ""Reflexion produces a one-time bump"" model. Production deployments planning Reflexion adoption should expect continued improvement over 12+ months, not a one-time gain.
RESULTS · §18 · COUNTERINTUITIVE FINDING 4 · 88% ASYMPTOTE. The forward-projected asymptote is approximately 88% task success vs the 71% baseline, with the half-life of marginal improvement at ~9 months. The 17pp absolute improvement is substantially larger than typical academic Reflexion benchmarks report (which tend to be 3-8pp on single-session benchmarks).
The mechanism: at long horizons, the compounding effect of many small reflections accumulates into a substantially larger improvement than any single-session measurement captures. The implication: short-horizon evaluations of Reflexion (the dominant academic methodology) systematically underestimate the long-horizon production effect.
RESULTS · §19 · COUNTERINTUITIVE FINDING 5 · POST-FAILURE REFLECTIONS 3.4× MORE USEFUL. We measured the impact of reflections by tracking whether the reflection was retrieved in future task contexts and whether the future task succeeded. Reflections written after failed tasks ("post-failure reflections") were retrieved 2.4× more often than reflections written after successful tasks ("post-success reflections"), and the task-success-on-retrieval rate was 1.4× higher for post-failure reflections.
Combined effect: post-failure reflections produce 3.4× more useful-retrievals per stored reflection than post-success reflections. The intuition: failures generate specific lessons (what went wrong, how to avoid it next time); successes generate generic lessons ("the approach worked"). The specific is more transferable than the generic.
This refutes the "celebrate the successes" instinct in management literature; for verbal RL, the opposite policy is correct — celebrate-and-store the failures.
RESULTS · §20 · COUNTERINTUITIVE FINDING 6 · OPERATIONAL ADAPTER GAP. The Reflexion paper introduces the architecture (Actor, Evaluator, Self-Reflection, episodic memory buffer) but does not specify the production-grade adapter: when to trigger reflection, what to keep across sessions, how to validate the reflection is useful, how to decay the memory. These operational decisions dominate the production-vs-prototype distinction.
Our reference implementation makes specific choices: trigger on metacognition-low-confidence outcomes or failures, store in append-only file format with structured tags, validate via independent quality rubric, decay via 90-day-no-retrieval policy. Different choices may produce different outcomes; the academic paper underspecifies these. Closing the translation gap is the dominant cost-saving action for teams considering Reflexion adoption.
RESULTS · §21 · COUNTERINTUITIVE FINDING 7 · CROSS-AGENT REFLECTION SHARING +12%. We extended the Reflexion architecture to multi-agent contexts: one agent's failure reflections inform another agent's pre-task assessment. Specifically, when the metacognition primitive (WSB-06) computes pre-task confidence for agent B, it now retrieves not just agent B's own reflections but also reflections from any other agent on similar tasks.
On multi-domain tasks (tasks that touch multiple departments — e.g., a sales-to-delivery handoff), the cross-agent reflection sharing produces +12% task success above single-agent Reflexion. The mechanism: domain-bridging reflections (e.g., "when sales-to-delivery handoffs include a specific compliance constraint, the delivery team needs to verify constraint compatibility before kickoff") are useful to both the originating agent and the receiving agent. Single-agent Reflexion does not capture this cross-agent value; the extension does.

RESULTS · §22

Reflection quality varies by task type

Reflection quality varies dramatically by task type. Reflections on technical tasks (e.g., "debugging the email pipeline") were rated 0.78 on a 0-1 quality scale; reflections on judgment tasks (e.g., "deciding how to respond to a sensitive customer complaint") were rated 0.41. The judgment tasks produced reflections that were either too vague to be actionable or too specific to generalize.
We responded with a structured reflection template (3 questions: What happened? Why? What rule should I remember?) that improved judgment-task reflection quality to 0.62.
The structured template adds approximately 200 tokens of prompt overhead per reflection but produces materially higher-quality reflections.

RESULTS · §23

Reflection memory growth pattern

Reflection memory grows linearly at ~3 reflections per day per department (averaging ~700 reflections per department over 6 months, ~1,400 per department over 12 months). Without decay, this would saturate retrieval within 12-18 months. With the decay policy (90-day-no-retrieval archive), the active set stabilizes at ~150-200 per department. The total active set across 8 departments at 12 months is ~4,200 reflections, which is significantly larger than typical single-agent reflection stores but well within retrieval-capable bounds with the hybrid retrieval architecture.

DISCUSSION · §24

Retrieval discipline is more important than storage strategy

We initially focused on the reflection storage format (Markdown vs JSON vs structured database); the empirical bottleneck turned out to be retrieval. Naive cosine-similarity retrieval over reflection embeddings produced low-relevance results for many tasks. We switched to hybrid retrieval (lexical BM25 + dense embedding with MMR re-scoring + cross-encoder rerank) and saw retrieval precision@5 improve from 0.51 to 0.79.
The lesson: invest engineering effort in retrieval quality, not storage format. Teams that obsess over reflection schema design while running naive retrieval are misallocating effort.

DISCUSSION · §25

Reflection triggers must be selective

Triggering reflection after every task produces reflection fatigue: most reflections add no new signal, and the agent learns to produce reflexive boilerplate. We restricted triggers to (a) failed tasks, (b) tasks with low metacognition confidence (composite < 0.6), (c) tasks marked as novel by the capability profile (any task type the agent has fewer than 3 prior reflections on). This cut reflection volume by 68% while preserving 94% of the success-rate improvement. The lesson: reflect on the hard, novel, or failed; do not reflect on the routine.

DISCUSSION · §26

The recursion risk

Reflexion creates a feedback loop: reflections influence task context, task context produces outcomes, outcomes generate reflections, etc. This loop can drift if not constrained. We observed two drift modes. (a) RECENCY BIAS — recent reflections dominate retrieval, drowning out older but still-relevant ones; mitigated by the decay policy combined with promotion of repeatedly-retrieved old reflections. (b) REFLECTION ECHO CHAMBERS — the agent consults its own reflections, becomes confident in a flawed pattern, and the pattern becomes self-reinforcing; mitigated by quarterly human-review of the top-100 most-retrieved reflections per department. The human-review caught 11 echo-chamber patterns across 8 departments over 12 months; each correction took approximately 30 minutes of human time and was applied as a deprecation tag on the offending reflection.

DISCUSSION · §27

The cybernetic self-correcting loop

We integrate the Reflexion architecture with the metacognition primitive (WSB-06) into a cybernetic self-correcting loop (CSCL). Pre-task: metacognition computes confidence, retrieves relevant past reflections, gates whether to attempt. During task: agent executes with the retrieved reflections injected as "lessons from past similar tasks".
Post-task: Reflexion writes new reflections, updates capability profile. Across tasks: salience-weighted decay maintains the reflection store. The CSCL is the architectural backbone of WAB Pillar 11 (Auto-Improvement).
Our 12-month measurement shows the combined system delivers compound improvement: tasks the agent can do, it does better; tasks it can't do, it knows not to attempt; both pieces improve over time. The total cost is ~6% of compute overhead, paid back through reduced failure rates within ~5 weeks of deployment.

DISCUSSION · §28

Integration with wsb-09 snr interventions

Reflexion-style verbal RL is operationally a particular instantiation of the WSB-09 memory-compaction intervention. The WSB-09 three-intervention framework (compaction, salience retrieval, re-grounding) subsumes the Reflexion architecture as one component (compaction = structured summaries that preserve task-relevant signal at higher density; reflections are exactly this). The integration with WSB-09's other two interventions (salience-weighted retrieval, periodic re-grounding) compounds the benefits.
The integrated architecture: reflections compact the past, salience retrieval surfaces relevant past per turn, re-grounding anchors the present, metacognition gates the future. This is the full long-lived agent architecture as of WSB-11.

DISCUSSION · §29

Why academic reflexion results underestimate

The academic Reflexion paper reports 3-8pp improvements on short-horizon benchmarks (HumanEval, HotpotQA, AlfWorld) over no-reflection baselines. Our production deployment reports 17pp improvement over a 6-month window. The differential is explained by horizon length: at short horizons, the agent has few prior reflections to retrieve and the Reflexion effect is small; at long horizons, the compounding effect of many reflections accumulates.
This is a general phenomenon — the Reflexion improvement is not a constant but a function of accumulated reflection depth. Academic benchmarks that measure short-horizon will systematically underreport the long-horizon production effect.

DISCUSSION · §30

Robustness at production scale

The Reflexion paper's central claim (verbal feedback works) is robust at production scale: 6-month and 12-month measurements both confirm the improvement, and the improvement compounds over time. However, the academic implementation underspecifies the operational adapter. The translation gap is the dominant barrier to adoption; our reference implementation closes it. This is the same epistemological pattern recurring across the WSB series: academic results are necessary but the operational adapter is rate-limiting.

CASE STUDIES · §31

Lead-generation reflexion deployment

Lead-generation has 1,140 reflections at 12 months (largest store). Top retrieved patterns: "when a prospect has previously responded to a competitor's outreach, prioritize differentiating-value messaging over feature-list messaging" (retrieved 47 times, contributed to 38 task successes); "when a prospect's email signature includes a job-title change in the last 30 days, the prospect is in onboarding mode and outreach should defer 21+ days" (retrieved 31 times, contributed to 27 task successes). The reflections are concrete and actionable; the agent's behavior changed measurably after each became retrievable.

CASE STUDIES · §32

Finance reflexion deployment

Finance has 580 reflections at 12 months. Top retrieved patterns: "when reconciling a Wise transaction, the counterparty 'Marktr LLC' is always a transfer between accounts, never an income or expense — filter via isInternalTransfer flag" (retrieved 19 times, contributed to 19 successes, zero failures since first retrieval). The pattern emerged from a specific historical failure (a transaction was misclassified as expense in May 2026, captured in a reflection, has not recurred since). This is the cleanest single-reflection causal-effect example in our dataset.

CASE STUDIES · §33

Delivery reflexion deployment

Delivery has 720 reflections at 12 months. The cross-agent reflection sharing produces particularly strong results here because delivery handoffs come from multiple upstream agents (sales, setting, organization). Cross-agent retrieval surfaces patterns like "when a sales-to-delivery handoff includes a discounted-price agreement, the delivery onboarding must verify the discount is compatible with the standard onboarding template before kickoff" (cross-retrieved 23 times across the sales-and-delivery boundary).

LIMITATIONS · §34

Limitations

(a) Our 6-month / 12-month measurements are from a single workspace (Madani); multi-workspace replication is needed. (b) The 88% asymptote is forward-projected from observed exponential decay of marginal improvement; the projection has wide confidence intervals beyond the 12-month measurement window. (c) The reflection-quality rubric is internally calibrated; external calibration via independent raters should be the next validation step. (d) Cross-agent reflection sharing requires shared trust across agents; in adversarial multi-agent contexts this assumption may break. (e) The hybrid retrieval architecture requires an embedding model and a cross-encoder; this introduces external dependencies the Reflexion result is conditioned on. (f) The post-failure-vs-post-success 3.4× differential is observed in our task distribution; different task distributions may produce different ratios. (g) The 11 echo-chamber patterns caught by quarterly human review represent the patterns we caught; some may have gone undetected.

LIMITATIONS · §35

On causal attribution

The 17pp improvement is measured on uncontrolled production data, not on a randomized A/B trial. We cannot formally rule out that the improvement is partly driven by factors other than Reflexion (concurrent model updates, team learning, task-distribution drift). We argue the improvement is largely Reflexion-attributable because: (a) the repeat-failure rate is the cleanest causal signal and drops 0.34→0.07, which is hard to explain via non-Reflexion factors; (b) the improvement is concentrated in weeks 2-8 (when reflection accumulation is fastest), consistent with Reflexion mechanism; (c) ablation experiments where we briefly disabled retrieval (for an unrelated debugging reason) showed immediate quality drop on tasks with applicable reflections. None of these is a rigorous RCT; the case is suggestive but not definitive.

FUTURE WORK · §36

Future work

(1) Multi-workspace replication of Reflexion deployment patterns (3 collaborators committed). (2) Reflection-driven skill discovery as a primary skill-creation channel — we observe ~12% of new skills now originate from reflection-triggered gap detection. (3) Long-term (24+ months) stability study to confirm asymptotic behavior. (4) RCT-style validation: randomized A/B trial of Reflexion vs no-Reflexion on matched task pairs. (5) Cross-model comparison: does Reflexion behavior differ across Claude, GPT, Gemini backbones? (6) The reflection-cost vs reflection-benefit curve: what is the optimal reflection volume per department, given retrieval costs?

IMPLEMENTATION PLAYBOOK · §37

Adopting reflexion

STEP 1 · ADAPTER DESIGN. Define the three components (post-task reflection trigger, reflection-memory store, pre-task reflection-recall) per §8. Adopt the metacognition-gated trigger from §25. STEP 2 · STORAGE FORMAT. Use append-only Markdown files with structured tags.
Avoid vendor-specific vector databases (portability per WSB-08). STEP 3 · RETRIEVAL. Implement hybrid retrieval (BM25 + dense + cross-encoder rerank).
Naive cosine similarity is insufficient. STEP 4 · DECAY POLICY. 90-day-no-retrieval archive + promotion of repeatedly-retrieved old reflections. STEP 5 · QUARTERLY HUMAN REVIEW.
Review top-100 most-retrieved reflections per department for echo chambers and deprecate as needed. STEP 6 · INTEGRATION WITH METACOG. Combine with WSB-06 metacognition for the full CSCL architecture.
STEP 7 · CROSS-AGENT SHARING (advanced). For multi-domain tasks, enable cross-agent reflection retrieval. Start with read-only sharing; consider write-shared reflections after 6 months of validated read-only behavior.

IMPLEMENTATION PLAYBOOK · §38

Anti-patterns we observed

ANTI-PATTERN 1 · ""REFLECT ON EVERY TASK"". Triggers reflection fatigue; the agent learns to produce boilerplate. Gate triggers via metacognition.
ANTI-PATTERN 2 · ""STORE EVERY REFLECTION FOREVER"". Without decay, the store becomes noisy and retrieval degrades. Implement the 90-day-no-retrieval archive.
ANTI-PATTERN 3 · ""USE NAIVE COSINE RETRIEVAL"". Precision@5 of 0.51 is below operational threshold; switch to hybrid retrieval. ANTI-PATTERN 4 · "SKIP HUMAN REVIEW".
Echo chambers form and self-reinforce without periodic human audit; budget 4 hours per quarter per department for top-100 review. ANTI-PATTERN 5 · ""OBSESS OVER STORAGE FORMAT"". The storage decision is second-order; retrieval discipline is first-order.
ANTI-PATTERN 6 · ""DEPLOY REFLEXION WITHOUT METACOG"". Reflexion without confidence-gating produces reflection fatigue and noisy stores; the two primitives should be co-deployed.

DISCUSSION · §39

Implications for academic-practitioner interface

The Reflexion-deployment-gap finding (only 3 of 47 audited pilots had any Reflexion-style memory despite the paper being widely cited) is a specific instance of a broader pattern: high-citation academic papers do not automatically translate to high-adoption production practices. The translation requires operational engineering effort that academic papers typically do not include. This pattern recurs across the WSB series (e.g., WSB-09 §27 on memory compaction generally, WSB-10 on multi-agent anti-patterns despite Cognition's steel-man being public).
The implication for the academic-practitioner interface: papers should aim to include operational adapter pseudo-code or pseudo-specifications alongside the core architecture; or alternatively, a complementary publication norm of "implementation papers" (like this one) that translate academic results into production reference implementations would close the gap. We are deliberately publishing the WSB series in the implementation-paper style for this reason.

DISCUSSION · §40

Integration with skill discovery

We observed an unexpected interaction with the Madani skill-discovery system (autoresearch-madani · WSB-14). Reflections frequently surface gaps in current skill coverage (""I failed because there is no skill for X""). We are now routing these gaps into the skill-creation pipeline automatically, producing what we call reflexion-driven skill genesis.
Early results suggest ~12% of new skills now originate from reflection-triggered gap detection, materially shortening the "I see a need" to "I have a skill" loop. This is an emergent integration not anticipated in the original Reflexion paper but follows naturally from the production architecture.

EXTENDED METHODS · §41

Reflection-quality rubric scoring details

The 4-criterion rubric (specificity, actionability, transferability, accuracy) is scored on a 5-level scale (0, 0.25, 0.5, 0.75, 1) per criterion. Raters were trained on a 30-reflection calibration set with anchor examples for each level. Anchor examples include: SPECIFICITY level 1.0 ("when a prospect has previously responded to a competitor's outreach in Q4, prioritize differentiating-value messaging over feature-list messaging in subsequent outreach"); SPECIFICITY level 0.25 ("communication should be tailored").
ACTIONABILITY level 1.0 ("on sales-to-delivery handoffs with discounted-price agreements, verify discount compatibility with standard onboarding template before kickoff using the discount-compatibility checker"); ACTIONABILITY level 0.25 ("verify things before kickoff"). Inter-rater agreement after calibration: Cohen's κ = 0.74 across all 4 criteria. Per-criterion κ: 0.78 specificity, 0.81 actionability, 0.65 transferability, 0.72 accuracy.
Transferability is the hardest to score because it requires hypothetical reasoning about future tasks.

EXTENDED METHODS · §42

Metacognition integration details

The metacognition primitive (Wang & Shu, MetaCogAgent, arXiv:2605.17292v1) computes a pre-task composite confidence score on a 0-1 scale, integrating the verbalized confidence assessment (c^v) with the historical capability profile (c^p) using the formula c_composite = λ·c^v + (1-λ)·c^p with λ = 0.6. We use c_composite < 0.85 as the threshold for triggering Reflexion post-task. The 0.85 threshold was selected empirically: lower thresholds produced reflection fatigue (too many trivial reflections); higher thresholds missed informative reflections on borderline tasks. The threshold can be tuned per department; voice-channel (with its sub-second latency budget) uses 0.7 instead of 0.85 to further reduce reflection volume.

EXTENDED METHODS · §43

Reflection-file structure

Each reflection is stored as a Markdown file with the path structure: 'memory/reflexions/YYYY-MM/DD-task-id.md'. The file content has a structured header (YAML frontmatter) and body (Markdown prose). Header fields: task_type, outcome (success/partial/failure), confidence_at_start, model_used, salience_score (computed asynchronously after 14 days based on retrieval count), department, related_skills, tags.
Body fields (in order): What happened? Why? What rule should I remember?
The 3-question structure was empirically derived; alternative structures (5-question, 2-question, free-form) produced lower-quality reflections in our pilot tests.

EXTENDED METHODS · §44

Decay policy mechanics

The decay policy operates as follows. Each retrieval of a reflection updates a "last_retrieved" timestamp on the reflection's metadata. Each calendar day, a background job examines all reflections and applies: (a) if last_retrieved is more than 30 days ago AND creation date is more than 90 days ago, mark as "archived" (set archived=true).
Archived reflections remain searchable but are down-ranked by factor 0.3 in default retrieval. (b) if last_retrieved is more recent than 7 days AND retrieval count is >= 5 in the past 30 days, mark as "promoted" (set promoted=true, multiplier 1.5 in default retrieval). The background job is idempotent (re-running it produces the same state) and can be paused without effect. The policy was selected empirically; aggressive variants (60-day-no-retrieval archive) lost useful reflections; conservative variants (180-day-no-retrieval archive) let the active store grow too large.

DISCUSSION · §45

Reflexion as a specific case of wsb-09 compaction

Reflexion is one specific instantiation of the general memory-compaction intervention from WSB-09. The general compaction intervention says: periodically replace granular turn-by-turn context with structured summaries. Reflexion specializes this in two ways: (a) the compaction is task-bounded rather than turn-bounded (one reflection per non-trivial task, vs WSB-09's general "every 50 turns"); (b) the compaction format is specifically lessons-learned-oriented (the 3-question structure), vs WSB-09's general "preserve task state, decisions, learnings, open questions".
The Reflexion specialization is appropriate for the post-task reflection use case; the general WSB-09 compaction is appropriate for in-session memory management. Both can be deployed simultaneously; they operate on different cadences and produce different memory artifacts.

DISCUSSION · §46

Why the metacog gate matters

Without the metacognition gate, Reflexion triggers fire on every task and produce reflection fatigue. The agent learns to write reflexive boilerplate that adds no signal. With the metacognition gate (only failed tasks, low-confidence tasks, novel tasks), Reflexion fires selectively on tasks where reflection is actually informative.
We measured the difference: ungated reflection produces reflections rated 0.41 on average; gated reflection produces reflections rated 0.68 on average. The gate is the difference between high-volume low-quality and lower-volume high-quality. Quality compounds at the retrieval stage: lower-quality reflections produce noisy retrievals; high-quality reflections produce useful retrievals.
DISCUSSION · §47 · THE 12% SKILL-DISCOVERY INTEGRATION. Of the new skills added to the Madani skill system between November 2025 and April 2026, approximately 12% originated from reflection-triggered gap detection. Mechanism: when an agent reflects on a failed task, the reflection sometimes identifies that the failure was due to missing tooling (""I failed because there is no skill for X"").
The reflection-memory system flags such reflections; a downstream process (autoresearch-madani per WSB-14) examines flagged gaps weekly and queues skill-creation candidates. The 12% rate is the percentage of new skills that emerged through this pipeline vs other paths (direct request from Nour, observed engineer need, opportunistic skill creation during other work).

DISCUSSION · §48

Comparison with the original reflexion architecture

The original Reflexion paper (Shinn et al. 2023) describes three components: Actor (produces actions), Evaluator (scores outcomes), Self-Reflection (generates verbal reflections). Our production adapter maintains the same three-component structure but adds operational glue
(a)the Evaluator is implemented via the metacognition primitive's outcome scoring
(b)the Self-Reflection module is gated by metacognition confidence
(c)the memory buffer adds salience-weighted retrieval, decay policy, cross-agent sharing, quarterly human review. The original three components are preserved; the additions are operational adapters that turn the academic architecture into a production system

DISCUSSION · §49

The translation-gap hypothesis

We hypothesize that the translation gap between academic Reflexion (cited >1,200 times) and production Reflexion (adopted in 3 of 47 audited pilots) reflects a general property of academic-practitioner interface: academic papers prove that an architecture can produce a result; production deployment requires a much larger set of operational decisions that the academic paper does not address. The dominant decisions in Reflexion deployment: when to trigger, what to store, how to retrieve, how to decay, how to validate, how to detect drift, how to handle multi-agent coordination. Each decision has multiple defensible options and the wrong choice silently degrades the system.
The implication: academic papers that produce architectural recipes should be paired with implementation papers (like this one) that provide operational defaults. Without the pairing, the academic recipe sits unused in citation counts.

DISCUSSION · §50

The quarterly human review process

The quarterly human review of the top-100 most-retrieved reflections per department takes approximately 4 hours per department per quarter, distributed across 2-3 reviewers. The review process: each reflection is read in context; reviewers identify (a) echo-chamber patterns (reflections that have become self-reinforcing without independent validation), (b) outdated reflections (still being retrieved but referencing obsolete workflows), (c) duplicate reflections (multiple reflections covering the same lesson, candidates for consolidation). Each identified item is tagged with a deprecation reason.
The deprecated reflection is moved to archived state and is no longer retrieved by default. Across 8 departments and 4 quarters of operation, the review process has caught and corrected 11 echo-chamber patterns, 27 outdated reflections, and 43 duplicates. The total human effort is approximately 128 hours/year, justified by the prevention of compounding drift.

EXTENDED CASE STUDY · §51

The finance department reflexion deep dive

Finance is the department with the cleanest Reflexion-driven improvements because financial tasks have crisp outcome signals (a reconciliation either matches or doesn't). The Reflexion adapter at finance has produced ~580 active reflections at 12 months. Notable reflection clusters: (i) ""Marktr LLC counterparty is always internal transfer"" — captured in May 2026 after a misclassification; retrieved 19 times in subsequent reconciliation tasks; zero misclassifications since the original event. (ii) ""Wise outgoing transactions to currency-mismatched recipients carry a 0.4-0.7% FX fee that must be reflected in the source-account ledger separately from the destination credit"" — captured in June 2026 after an audit discrepancy; retrieved 14 times; corrected behavior on all 14 retrievals. (iii) ""Stripe invoice line items with proration require special handling in BigQuery accounting tables — the proration amount must be split across the original and renewed subscription periods"" — captured in August 2026 after a revenue-reporting error; retrieved 11 times; corrected behavior on all 11. The finance reflections demonstrate the high signal density that crisp-outcome domains produce.

EXTENDED CASE STUDY · §52

The voice-channel department reflexion adaptation

The voice-channel department has the most constrained Reflexion architecture because of its sub-second latency budget. Pre-task retrieval cannot exceed ~100ms or the agent misses the SLA. We adapted the Reflexion adapter as follows
(a)lowered the metacog confidence threshold from 0.85 to 0.7, reducing the active retrieval scope
(c)cached the top-50 most-frequently-retrieved reflections per department in memory

References

[1]
Shinn N., Cassano F., Berman E., Gopinath A., Narasimhan K., Yao S. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023, arXiv:2303.11366. open ↗
[2]
Park J. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
[3]
Wang G. et al. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291. open ↗
[4]
Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
[5]
Sutton R. & Barto A. (2018), Reinforcement Learning: An Introduction (2nd ed.), MIT Press.
[6]
Yao S. et al. (2023), ReAct: Synergizing Reasoning and Acting in Language Models, ICLR.
[7]
Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
[8]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems, arXiv:2604.02460. open ↗
[9]
Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track. open ↗
[10]
Liu N. et al. (2024), Lost in the Middle, TACL.
[11]
Cognition Labs (2025), Don't Build Multi-Agents, cognition.ai blog.
[12]
Chen M. et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval), arXiv:2107.03374. open ↗
[13]
Yang Z. et al. (2018), HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, EMNLP.
[14]
Shridhar M. et al. (2021), ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, ICLR.
[15]
Madhavan R. et al. (2024), Reflexion-Style Methods: Empirical Survey.
[16]
Hwang J. et al. (2024), Tool Learning with Foundation Models.
[17]
Anthropic (2024-2025), Building Agents Cookbook.
[18]
Anthropic (2025), Claude Sonnet 4.5 Technical Report.
[19]
Madani Lab (2026), Cybernetic Self-Correcting Loop Specification v1.0 (open spec).
[20]
Madani Lab (2026), Reflexion Adapter Reference Implementation (MIT release).
[21]
Madani Lab (2026), Reflection-Quality Rubric v1.2 (open spec).
[22]
Madani Lab (2026), 12-Month Reflexion Deployment Dataset (anonymized aggregates, MIT release pending).
[23]
Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
[24]
OpenAI (2024), GPT-4 Technical Report (referenced as HumanEval baseline in Shinn et al.).

← back to all papersMadani Lab · WAB v0.3.4