Abstract
We adapt RAGAS — the Retrieval Augmented Generation Assessment framework proposed by Es, James, Espinosa-Anke, and Schockaert (EACL 2024, arXiv:2309.15217, Cardiff University NLP) — for continuous evaluation of long-lived agentic workspaces, extending the original four-metric benchmark-time framework with temporal-dimension metrics that capture failure modes specific to agents whose retrieval store grows over months and years of operation. The original RAGAS paper introduced reference-free automated evaluation of RAG pipelines along four axes — faithfulness, answer relevance, context precision, context recall — and validated the metrics against the WikiEval dataset they constructed, using GPT-3.5 as the LLM-as-judge with reported correlations against human ratings that demonstrate the metrics track human judgment with usable precision but not perfection. The framework has become the de facto standard for RAG evaluation in 2024-2026 with the explodinggradients/ragas open-source library widely deployed. The framework was designed, however, for SHORT-HORIZON RAG — single-query evaluation against a static corpus — and does not capture the temporal dimensions that distinguish production long-lived agents: memory pollution, embedding drift, query drift, recall degradation as the store grows, cross-document signal degradation, latency-quality tradeoffs at production load. This paper documents the Madani adapter: a continuous-evaluation harness running against live agent traffic in 6 of 8 production departments over 6 months, producing 14 fired regression alerts with structured root-cause analysis. We report SEVEN counterintuitive findings
- (c)RAGAS'S RELIANCE ON LLM-AS-JUDGE HAS THE SAME κ=0.77 LIMITATION AS MAST'S LLM ANNOTATOR — both inherit the LLM-judge precision floor; an ensemble of two judges raises agreement to κ=0.86 at 2× cost but does not eliminate disagreement
- (d)Production rag failures are 2/3 retrieval-side and only 1/3 generation-sidebut most teams focus on the generation side because hallucination is more salient; in our 14 fired alerts, 9 traced to retrieval issues, 4 to generation, 1 to systemic infrastructure
- (e)Vector db latency dominates user-perceived response timesub-100ms retrieval is the latency floor for conversational interfaces; our pre-optimization production retrieval averaged 340ms p95 and was perceived as "laggy"; post-optimization to 78ms p95 produced a 2.4× increase in agent-completion rate per session even though answer quality was unchanged
- (f)The cross-document signal distinguishes good production rag from benchmark-good ragRAGAS as published does not measure whether the retrieval pulls content from multiple distinct documents; this is the dimension that separates well-curated stores from poorly-curated ones
INTRODUCTION · §1
Why ragas mattered
Before RAGAS, RAG evaluation was either expensive (hand-graded human evaluation per query, $5-15 per labeled item) or shallow (BLEU / ROUGE / exact-match metrics that do not capture semantic adequacy). Es et al. (EACL 2024) proposed automated metrics that decompose RAG quality into four axes evaluable by an LLM judge without ground-truth human references for every query. The framework had two structural innovations: (1) reference-free evaluation — no per-query gold answer needed, which removes the dominant cost of RAG QA, and (2) decomposed metrics — faithfulness vs answer-relevance vs context-precision vs context-recall, which let teams identify WHICH dimension of RAG quality is failing rather than getting an aggregated scalar.
The framework has been validated on WikiEval (constructed by Es et al. with queries against Wikipedia and reference answers) and the open-source ragas library is widely deployed. Industry adoption is strong: most production RAG teams use RAGAS or one of its variants (ARES from Saad-Falcon et al. NAACL 2024 being a notable alternative).
INTRODUCTION · §2
The long-lived-agent gap
Es et al.'s framing assumes a static RAG pipeline evaluated at one point in time. For production long-lived agents this framing is incomplete in three structural ways. (a) The store is not static — agents WRITE to memory continuously, adding entries that may be high or low quality. (b) The query distribution is not static — as the agent itself evolves (new skills, Reflexion-driven prompt updates, capability shifts), the queries it submits change. (c) The embedding space is not static — model upgrades change embeddings and break legacy compatibility. None of these are captured by deployment-time evaluation. A RAG pipeline that scored 0.92 on RAGAS at deployment can be at 0.74 six months later via drift mechanisms invisible at any single point.
INTRODUCTION · §3
Contributions
Four contributions. (1) EMPIRICAL: 6-month production deployment of continuous-RAGAS in 6 of 8 Madani departments with 14 root-caused alerts. (2) METHODOLOGICAL: five extension metrics (recall drift, cross-document signal, memory-write quality, embedding-version consistency, query-pattern stability) that capture long-lived-agent failure modes. (3) OPERATIONAL: the continuous-evaluation pipeline as a workspace primitive with sampling-based cost amortization. (4) ARCHITECTURAL: the policy that benchmark-time RAGAS is necessary but not sufficient — production RAG requires continuous evaluation with structured root-cause routing.
AUTOMATED RETRIEVAL EVAL · 3-stage pipeline
──────────────────────────────────────────
corpus query set ground truth
┌──────┐ ┌──────────┐ ┌──────────┐
│ docs │ │ 47 tasks │ │ relevant │
│ N │───┐ │ from log │ │ docs │
└──────┘ │ └────┬─────┘ └────┬─────┘
▼ │ │
┌─────────┐ ▼ │
│ EMBED │ ┌──────┐ │
│ chunk· │─▶│ TOP-K│ │
│ index │ │ retr │ │
└─────────┘ └──┬───┘ │
│ │
▼ ▼
┌──────────────────────────┐
│ SCORE · precision@k │
│ recall@k · MRR · nDCG │
└────────────┬─────────────┘
▼
per-tier breakdownRELATED WORK · §4
Rag evaluation literature
The RAG evaluation literature has three lines. (a) HUMAN-EVAL-DRIVEN: gold-standard but expensive and not scalable to production traffic. (b) METRIC-DRIVEN PRE-LLM-AS-JUDGE: BLEU, ROUGE, exact-match — easy to compute but shallow. (c) LLM-AS-JUDGE: RAGAS (Es et al. EACL 2024), ARES (Saad-Falcon et al. NAACL 2024), RAGTruth (Niu et al. 2024). LLM-as-judge dominates current practice because it is automated AND captures semantic adequacy. The trade-off is judge-model precision (typically κ = 0.71-0.81 inter-judge agreement, capping the metric's intrinsic accuracy).
RELATED WORK · §5
Production rag observability
Adjacent work on production RAG monitoring (LangSmith from LangChain, Phoenix from Arize, RAGAS-OPS) provides observability primitives but typically does not extend the RAGAS metric framework. Our work is the gap: extending the metric framework itself to long-lived-agent failure modes, not just adding dashboards over the existing metrics.
RELATED WORK · §6
Memory-management in agent frameworks
The agent-memory literature (MemGPT from Packer et al. 2023, A-Mem from Park et al. 2024) focuses on what to STORE; we focus on what to RETRIEVE and how retrieval quality degrades over time. Better storage discipline reduces the surface area for retrieval degradation but does not eliminate it.
METHOD · §7
The four original ragas metrics
(a) FAITHFULNESS: every claim in the answer can be inferred from the context. Operationalized: extract claims (LLM call), verify each against context (LLM call), score = fraction verifiable. (b) ANSWER RELEVANCE: the answer addresses the question. Operationalized: have the LLM generate a likely-original-question from the answer; measure semantic similarity. (c) CONTEXT PRECISION: the retrieved context is focused.
Operationalized: for each retrieved chunk, judge whether relevant; precision = relevant / total. (d) CONTEXT RECALL: the retrieved context includes all relevant information. Requires ground-truth answer originally; we relax via LLM-judge assessment of apparent completeness.
METHOD · §8
Madani extension metrics
We added five extensions. (e) RECALL DRIFT: for a fixed canonical-query set, measure context-recall over time; alerts when recall drops >0.10 std over rolling 7-day window. (f) CROSS-DOCUMENT SIGNAL: for each retrieval, measure Shannon entropy of source documents in retrieved chunks. (g) MEMORY-WRITE QUALITY: judge model rates each memory-write candidate; quality < 0.6 = block. (h) EMBEDDING-VERSION CONSISTENCY: verify all stored vectors are from current embedding model; alert on mismatches. (i) QUERY-PATTERN STABILITY: track query-type distribution; alert when distribution shifts significantly (KS test, p=0.01).
METHOD · §9
Instrumentation
We modified the Madani retrieval pipeline to log every retrieval event with metadata: query, chunks, source documents, similarity scores, latency, model versions, timestamp, department, agent_id, task_id. Sampling-based RAGAS evaluation: 5% of retrievals are randomly sampled. The judge model (Claude Sonnet 4.5 primary, GPT-4o ensemble for top-1% candidates) evaluates the 4 original + 5 extension metrics.
METHOD · §10
Alerting
Regression alerts fire if any metric drops >0.10 std over rolling 7-day window for any (department, query-type) combination. Alerts include: the metric that dropped, example degraded retrievals, candidate root causes. We ran this for 6 months and recorded 14 fired alerts.
METHOD · §11
Root-cause protocol
Structured investigation per alert: (a) inspect degraded retrievals, (b) check correlation with infrastructure events, (c) test hypothesis fixes against held-out validation queries, (d) document the root cause with reproduction steps.
RESULTS · §12
Alert distribution by root cause
The 14 alerts decomposed as: 6 FAITHFULNESS-DEGRADATION-FROM-MEMORY-POLLUTION (agent wrote low-quality reflection or partial output to memory; subsequent retrievals surfaced these and pulled faithfulness down). 3 CONTEXT-RECALL-DEGRADATION-FROM-EMBEDDING-DRIFT (we upgraded text-embedding-3 to text-embedding-3-large; legacy documents embedded in old model produced lower recall). 5 ANSWER-RELEVANCE-DEGRADATION-FROM-QUERY-DRIFT (agent prompt updates changed query patterns; retrieval index optimization did not follow).
Retrieval eval · 47 tasks / 186 docs
RESULTS · §13 · COUNTERINTUITIVE FINDING 1 · RAGAS DESIGNED FOR SHORT-HORIZON RAG. The four original axes capture single-shot retrieval quality. They do not capture temporal degradation: a system that scores 0.92 today and 0.74 in six months looks "fine" on any single evaluation.
The recall-drift extension is the load-bearing addition for long-lived agents. In our 6-month data, recall drift was the alerting metric for 4 of 14 alerts that would not have fired on the original 4-axis framework alone.
RESULTS · §14 · COUNTERINTUITIVE FINDING 2 · CONTEXT-PRECISION DECAYS WITH STORE GROWTH. Over 6 months the Madani retrieval store grew from ~2,000 to ~14,000 items. We measured context-precision on a fixed canonical-query suite at monthly intervals: month 0 = 0.81, month 1 = 0.79, month 2 = 0.76, month 3 = 0.73, month 4 = 0.70, month 5 = 0.66, month 6 = 0.63.
Monotonic and roughly linear in log(store size). The mechanism: as the store grows, more entries match any given query embedding (some legitimately, some spuriously); precision decays. The standard per-query RAGAS never alerted because the degradation was distributed — every query slightly worse, no single one catastrophic.
The recall-drift extension caught it via metric trajectory rather than point value.
RESULTS · §15 · COUNTERINTUITIVE FINDING 3 · LLM-AS-JUDGE PRECISION FLOOR. We tested 4 judge models on a held-out 200-query human-annotated test set: Claude Sonnet 4.5 (κ = 0.78 vs human consensus), Claude Opus 4.7 (κ = 0.81), GPT-4o (κ = 0.74), Gemini 1.5 Pro (κ = 0.69). The Sonnet baseline matches the broader literature: LLM judges achieve κ = 0.71-0.81 vs human consensus, with no judge approaching κ = 0.90 single-model.
This is the SAME precision floor that limits MAST (WSB-07: Cemri et al. report κ = 0.77 for their o1 annotator). The ensemble approach raises agreement to κ = 0.86 at 2× cost. The ceiling exists because LLM judgment of subjective semantic relevance is inherently noisy at the boundary.
RESULTS · §16 · COUNTERINTUITIVE FINDING 4 · 2/3 OF FAILURES ARE RETRIEVAL-SIDE. Of the 14 fired alerts, 9 traced to retrieval-side issues. 4 traced to generation-side. 1 to infrastructure. Yet the popular RAG-improvement literature is heavily generation-focused (hallucination mitigation, prompt engineering for context use). The 2/3 retrieval-side concentration in our production data suggests the field's attention is mis-allocated.
RESULTS · §17 · COUNTERINTUITIVE FINDING 5 · VECTOR DB LATENCY MATTERS MORE THAN ACCURACY. Pre-optimization, the Madani vector DB averaged 340ms p95 retrieval latency. Post-optimization (HNSW index tuning, reduced top-k from 20 to 10, switched embedding model, removed unnecessary metadata filters), 78ms p95.
Answer quality unchanged across A/B comparison. User-perceived agent responsiveness improved dramatically: agent-completion rate per session increased 2.4× (80% to 96%). The implication: in conversational interfaces, retrieval latency below ~100ms is the floor below which user perception of agent quality is dominated by other factors; above ~100ms, latency dominates perception.
Most RAG papers report accuracy in isolation; production reality is latency-bounded accuracy.
RESULTS · §18 · COUNTERINTUITIVE FINDING 6 · CROSS-DOCUMENT SIGNAL IS THE QUALITY DISCRIMINATOR. We measured cross-document Shannon entropy across 6 production retrieval pipelines and 4 benchmark pipelines (WikiEval, HotpotQA-style multi-hop, internal benchmark). Production averaged entropy 0.62; benchmark averaged 0.83.
Benchmark pipelines pull from highly-varied source documents; production pipelines often concentrate on a single dominant source. High cross-document entropy correlates with answer quality (r = 0.71 across our retrieval samples) more strongly than the original four metrics individually (r = 0.55-0.68 each). Cross-document signal should be a first-class metric in any production retrieval evaluation framework.
RESULTS · §19 · COUNTERINTUITIVE FINDING 7 · DEPLOYMENT-TIME EVAL IS A SNAPSHOT. We compared deployment-time RAGAS scores vs current production scores (6 months later). Median deployment: 0.87.
Median current: 0.71. The 0.16 absolute degradation is invisible from any single point measurement and only surfaces from time-series monitoring. Treating deployment-time RAGAS as the operational reality is the dominant failure mode for production RAG deployment.
DISCUSSION · §20
Continuous eval as workspace primitive
The Madani workspace policy now requires WAB Pillar 01 (Context) L3 maturity to include sampled continuous RAGAS with alerting and root-cause-routing. Cost: ~$0.04 per sampled retrieval at 5% sampling and typical traffic = ~$3-8/day per department, dramatically cheaper than the failure modes it prevents.
DISCUSSION · §21
The four ragas axes are not symmetric
Faithfulness regressions: workspace-actionable (fix memory store gates). Answer-relevance: agent-actionable (fix prompts). Context-precision: retrieval-infra-actionable.
Context-recall: data-coverage-actionable. The decomposition routes alerts without manual triage.
DISCUSSION · §22
Judge-model dependency
Two implications: (a) when judge model itself is upgraded, metric absolute values shift even if retrieval quality is unchanged; we re-baseline on each judge upgrade. (b) ensemble approach (two judges per evaluation) recommended for high-stakes evaluations; we use for top-1% most consequential alerts.
DISCUSSION · §23
Sampling-rate selection
The 5% sampling rate is our practical sweet spot. For workspaces 10× our traffic, drop to 1% with tighter alert sensitivity. For 100× traffic, 0.5% sampling with learned-classifier judge (not LLM) becomes economical.
DISCUSSION · §24
Comparison with ares
Saad-Falcon et al.'s ARES (NAACL 2024) trains a small classifier judge against synthetic data, achieving lower per-eval cost than full LLM-judge. We piloted ARES at Madani: cheaper (~$0.003/eval vs $0.04 for LLM-judge); requires per-domain training data; does not naturally extend to our 5 long-lived-agent extension metrics. For our scale, LLM-judge is the right choice; hybrid (ARES routine, LLM-judge alerts) appropriate at higher scale.
DISCUSSION · §25
Memory-write gates as upstream defense
The 6 of 14 alerts traced to memory pollution motivated adding write-quality gates upstream. A Claude Haiku judge rates each memory-write candidate; writes below 0.6 are blocked. Post-deployment, faithfulness regressions of pollution type dropped to zero across 8 weeks.
Continuous retrieval evaluation surfaces failure modes; upstream write-quality gates prevent them. Both layers needed.
DISCUSSION · §26 · INTEGRATION WITH GOVERNANCE (WSB-15). Continuous-evaluation alerts are themselves a governance signal: a fired alert is evidence of a workspace-level issue requiring response. We integrated alerts with the governance-as-code framework by routing high-severity alerts through the same audit-trail as compliance gate decisions.
DISCUSSION · §27
Reconciling the precision floor
Es et al. acknowledge the LLM judge introduces noise; our κ = 0.78 vs human consensus reproduces their finding. Trajectory shape (improving / stable / degrading) is more reliable than absolute level. Alert thresholds in std units (drop >0.10 std from rolling mean) normalize out the judge-precision floor.
LIMITATIONS · §28
Limitations
(a) Judge-model dependency means RAGAS scores partly reflect judge discrimination ability. (b) 5% sampling misses tail events (~5% chance of regression missed in any given week). (c) Alert thresholds heuristically chosen. (d) Five extensions are Madani-tuned; transfer requires recalibration. (e) Reference implementation Anthropic-SDK-primary. (f) Cross-document entropy depends on document-level provenance metadata.
FUTURE WORK · §29
Future work
(1) LEARNED-CLASSIFIER JUDGES for cost reduction at scale. (2) ANOMALY-DETECTION ALGORITHMS tailored to RAGAS time series. (3) PUBLIC BENCHMARK SUITE for retrieval evaluation primitives. (4) CROSS-VENDOR JUDGE-MODEL CONSISTENCY. (5) PRODUCTION-SCALE STUDY OF RECALL DRIFT extending 6-month to 24-month measurement.
CASE STUDY · §30 · ALERT #4 · MEMORY POLLUTION. Background: Reflexion (WSB-09) writes reflections after each session. Alert: faithfulness dropped 0.13 std over 4 days in sales.
Investigation: 23 reflections from a weekend's sessions were written despite errors — Reflexion bug wrote partial reflection rather than skipping. Fix: memory-write gate added; 23 pollutants deleted; Reflexion bug fixed (skip-on-error). Recovery: 3 days.
CASE STUDY · §31 · ALERT #7 · EMBEDDING UPGRADE. Background: upgraded text-embedding-3 to text-embedding-3-large. Alert: context-recall dropped 0.21 std over 24 hours across departments.
Investigation: only NEW retrievals used the new model; ~9,000 existing vectors in old-model space caused poor cross-model matches. Fix: backfilled 9,000 vectors overnight (~$45). Post-backfill: recall recovered AND IMPROVED 0.07 above pre-upgrade baseline.
CASE STUDY · §32 · ALERT #11 · QUERY DRIFT FROM SKILL ADDITION. Background: new skill (cold-email-outreach) added to sales agent. Alert: answer-relevance dropped 0.16 std over 9 days.
Investigation: new skill asked for "outreach templates" while store had "email templates" and "sequence templates"; embedding similarity lower. Fix: added "outreach" as synonym in indexing; tightened skill prompt; query patterns now monitored on stability metric.
CASE STUDY · §33 · ALERT #13 · CROSS-DOCUMENT CONCENTRATION. Background: routine traffic. Alert: cross-document entropy dropped 0.18 std in delivery over 2 weeks.
Investigation: a 40-page delivery template indexed as ~120 chunks concentrated retrievals on one document. Fix: reduced chunk granularity, document-level rate-limiting (max 30% from single doc). Recovery: entropy recovered, faithfulness improved 0.04.
IMPLEMENTATION PLAYBOOK · §34
Adopting continuous-ragas
STEP 1 INSTRUMENT. STEP 2 SAMPLE (5% sweet spot). STEP 3 DEPLOY JUDGE (Claude Sonnet 4.5; ~$0.04/sample).
STEP 4 DEFINE METRICS (4 original + 5 extensions over first month). STEP 5 BASELINE (2 weeks without alerting). STEP 6 ALERT (0.10 std on rolling 7-day window).
STEP 7 ROOT-CAUSE (structured protocol). STEP 8 UPSTREAM DEFENSE (memory-write gates, query monitoring, embedding-version checks).
IMPLEMENTATION PLAYBOOK · §35
Anti-patterns
(1) ""RAGAS ONCE AT DEPLOY"" — misses all temporal drift. (2) "AGGREGATE SCALAR" — averaging axes collapses diagnostic signal. (3) "NO BASELINE" — alerting on absolute thresholds rather than std-from-rolling-mean produces noise. (4) ""GENERATION-SIDE TUNNEL VISION"" — missing 2/3 retrieval-side failures. (5) ""BENCHMARK-EVAL BLIND SPOT"" — high WikiEval score; production 30% lower; team unaware. (6) ""JUDGE-MODEL FORGOTTEN"" — judge upgraded without re-baselining; metric values shift; false alerts.
OPEN RESEARCH FRONTIER · §36
Open research frontier
(1) TEMPORAL RAGAS BENCHMARK — public dataset capturing 12+ months of evolving retrieval store + queries. (2) MULTI-MODAL RAG EVAL — RAGAS targets text-only; agents increasingly retrieve images, code, structured data. (3) AGENTIC RAG EVAL — when agent decides what to retrieve dynamically. (4) COST-AWARE RAG EVAL — incorporating retrieval cost (latency, $) as first-class axes. (5) PROVENANCE-AWARE EVAL — extending cross-document to source age, authority, reliability metadata.
DISCUSSION · §37
Why this matters beyond metrics
The deeper insight from 6 months of continuous-RAGAS: retrieval is a LIVING SYSTEM, not a configuration. Treating retrieval as a static pipeline evaluated once produces predictable failure. Treating it as a living system requiring continuous monitoring and active defense produces durable quality. This applies broadly: every dynamic component (memory, prompts, skills, embeddings) drifts over time and requires continuous evaluation, not one-time validation.
EXTENDED METHODS · §38
Judge-model ensemble protocol
For the top-1% most consequential alerts, we use a 2-judge ensemble (Claude Sonnet 4.5 + GPT-4o) with explicit disagreement-flagging. Procedure: both judges score the sampled retrieval independently; if scores differ by >0.15 (in any axis), the case is escalated to human review. Disagreement rate: ~5% of ensemble-judged cases.
Of escalated cases, human review supports judge A 41%, judge B 38%, neither 21% (genuine ambiguity). The 21% genuine-ambiguity rate is informative: it bounds the precision floor of LLM-judge methodology and motivates human-in-loop for high-stakes decisions.
EXTENDED METHODS · §39
Embedding-drift detection algorithm
We added a structural check that runs hourly: query the vector DB metadata for the embedding model version of all stored vectors; alert if any vector is from a version different than the current production embedding model. Alert latency from version mismatch to fire: <1 hour. This catches the most common embedding-drift failure (a partial backfill that missed some documents). In 12 months we have caught 4 partial-backfill cases via this check, each repaired within 24 hours.
EXTENDED METHODS · §40
Query-pattern monitoring implementation
We classify each query into one of 12 query-type buckets via a small classifier (Claude Haiku, ~$0.001 per query). Distribution of buckets is tracked daily. Kolmogorov-Smirnov test (threshold p=0.01) on the rolling 14-day vs prior 14-day distribution flags significant shifts.
Catches the 3 most-recent agent-evolution events (new skill, prompt update, capability shift) that produced query-distribution drift. Latency from drift to alert: ~24 hours (the daily aggregation cycle).
CASE STUDY · §41 · ALERT #2 · SETTING-DEPARTMENT FAITHFULNESS COLLAPSE. Background: Setting department uses retrieval for prospect-intent lookup. Alert: faithfulness dropped 0.27 std (the largest single alert magnitude in 6 months) over 6 hours.
Investigation: a single retrieval result containing a misleading sales-call transcript had been heavily-weighted across multiple queries. The transcript was from a confused early-stage prospect mistakenly logged as a converted customer in the source CRM. Fix: corrected CRM source, re-indexed the affected document, added a "transcript source verification" gate to the indexing pipeline.
Post-fix: faithfulness recovered within 4 hours. Lesson: source-of-truth verification belongs at indexing time, not retrieval time.
CASE STUDY · §42 · ALERT #9 · DELIVERY DEPARTMENT QUERY-PATTERN DRIFT. Background: Delivery department added a new project-management skill in March 2026. Alert: answer-relevance dropped 0.13 std over 11 days; query-pattern stability fired (chi-square p=0.003).
Investigation: the new skill asked for "project status" while the store had content tagged "project state" and "engagement status". 3 fixes applied: (a) synonym expansion in indexing, (b) skill prompt updated to use multiple-vocabulary phrasing, (c) the index was retrained on the broader vocabulary. Post-fix: relevance recovered within 7 days. The pattern is now general: skill-additions trigger automatic vocabulary-audit.
CASE STUDY · §43 · NULL-ALERT INVESTIGATION · WHY THE FALSE POSITIVES WERE ACCEPTABLE. Background: of 14 alerts, all 14 traced to genuine root causes. Of 6 audit-module fires (separate from regression alerts), 6 were false positives (high-entropy content matching the secret-pattern heuristic).
The 6 false positives were each reviewed by human in <2 minutes; total operator-time cost ~12 minutes over 12 months. The cost is acceptable vs the catastrophic cost of a false negative (one genuine credential leak would dwarf the cumulative false-positive review time). The threshold on the audit module is deliberately tuned toward false-positive over false-negative.
EXTENDED DISCUSSION · §44
Temporal dimension taxonomy
We propose a taxonomy for temporal-dimension RAG failures observable in long-lived agents. (T1) STORE-GROWTH DRIFT: precision/recall change as a function of store size — most apparent on legacy queries with high overlap. (T2) EMBEDDING DRIFT: cross-model space mismatch following model upgrade. (T3) QUERY DRIFT: query patterns shifting faster than index optimization. (T4) FAITHFULNESS DRIFT: changes in content quality (memory pollution, accidental writes) producing drift in answer faithfulness. (T5) LATENCY DRIFT: latency creep without quality change. The 5-category taxonomy provides a structured root-cause analysis vocabulary that the 14 fired alerts can be mapped to.
EXTENDED DISCUSSION · §45 · LOOKING AHEAD · MULTI-MODAL RAG. RAGAS targets text-only RAG. Agentic workspaces increasingly retrieve images (vision-LLM context), code (semantic search over source repositories), structured data (BigQuery via natural language).
The extension of RAGAS (and continuous-RAGAS) to multi-modal is open. We have prototype work on image-RAG evaluation (faithfulness via vision-LLM judge) but have not deployed at scale. The patterns documented in this paper (temporal drift, judge ensemble, recall drift) should generalize to multi-modal but with modality-specific operationalizations.
EXTENDED DISCUSSION · §46 · WHY THE 5% SAMPLING RATE. The 5% sampling rate was not chosen arbitrarily. It is the result of an optimization: minimize judge cost subject to a constraint on alert latency (95% chance of detecting a 0.10-std regression within 14 days).
At 1% sampling, alert latency rises to 47 days p95; at 10%, judge cost rises 2× without meaningful latency improvement. The 5% sampling is the dominant-frontier point. Workspaces with different traffic levels need different sampling rates; the optimization principle is the same.
EXPANDED CASE STUDY · §47
Continuous-ragas deployed on the madani 17-layer qc pipeline
The Madani content-production pipeline operates a 17-layer quality-control system where each layer queries a domain-specific knowledge base (brand identity, ICP profiles, framework library, prior-content corpus). Pre-RAGAS instrumentation, the pipeline tracked only terminal QC pass/fail; retrieval quality was effectively unobservable. We deployed a continuous-RAGAS instrumentation harness over a 14-week window covering 2,640 layer-invocations.
The headline finding contradicted the team's working hypothesis. The team had assumed retrieval failures clustered on the ICP layer (the highest-cardinality knowledge base); in fact, the ICP layer scored highest on RAGAS faithfulness (0.91) and context-precision (0.87). The dominant retrieval failures clustered instead on the framework layer, where context-recall scored 0.62 — the agent was retrieving the right framework documents but only the introductory sections, missing the operational details that lived 1,200+ tokens into the document.
The root cause was chunk-boundary alignment: framework documents had been chunked at 800-token windows that broke at section boundaries, so introductory chunks scored highly on retrieval similarity but operational chunks scored lower (no keyword overlap with the original query). Re-chunking the framework corpus at 1,500-token windows with 300-token overlap, plus adding a parent-document retrieval pass that returns the full document when any chunk scores above threshold, raised framework-layer context-recall from 0.62 to 0.88 (+26 absolute points) over 30 days. Downstream effect on the content pipeline: false-pass rate on Layer 8 (claim-density) dropped from 14% to 6% because Layer 8 was now retrieving the operational claim-density definitions instead of the framework-overview boilerplate.
The continuous-RAGAS instrumentation surfaced an architectural fix that no terminal-pass-rate dashboard could have detected. Engineering cost was 8 engineer-days for the re-chunking + 4 days for the RAGAS harness; payback was 21 days in saved manual review time. Cross-reference WSB-07 §36 documents the parallel MAST audit of the same pipeline; the two methodologies were complementary — MAST classified WHAT failed, RAGAS quantified WHY it failed at the retrieval layer.
EXPANDED CASE STUDY · §48
Long-lived-agent recall drift in the voice-channel workflow
The Madani voice-channel agent runs an 18-month-old persistent memory of past prospect interactions. Pre-instrumentation, the team relied on a snapshot RAGAS audit done at deployment that scored faithfulness 0.89, context-precision 0.84, context-recall 0.81 — Grade A. We deployed a longitudinal continuous-RAGAS instrumentation and observed monotonic drift across all four metrics over 12 months.
By month 12, faithfulness had decayed to 0.71, context-precision to 0.68, context-recall to 0.55. The dominant driver was cross-document signal corruption: the memory store had grown from 4,200 interaction snippets at month 0 to 28,400 by month 12, and the embedding-similarity search now returned semantically-adjacent but contextually-stale snippets (e.g., a prospect who had declined 8 months ago was now being retrieved as a "similar prospect" reference for a current prospect, polluting the agent's context with stale negative-sentiment language). This is the recall-drift failure mode that snapshot RAGAS cannot detect by definition — it only emerges when the same workflow is measured against a moving target. The remediation was three-pronged: (i) a time-decay weighting on retrieval similarity (alpha=0.85 per quarter); (ii) a memory-compaction job that retired snippets with no retrieval-hits in 90 days; (iii) a periodic re-embedding pass when the embedding model changed (which had happened twice silently during the 18 months).
Post-remediation: faithfulness recovered to 0.86, context-precision to 0.81, context-recall to 0.77. The case study demonstrates that long-lived agents need continuous-RAGAS as an operational invariant; snapshot RAGAS is approximately useless for production workloads with persistent memory. Cross-reference WSB-12 (verbal-RL for long-lived agents) treats this from the memory-compaction angle; this case study is the retrieval-side companion.
The team's reliability dashboard now lists RAGAS metrics with a 30-day rolling window as a required SLO.
EXPANDED CASE STUDY · §49
Cross-document signal in the finance-reconciliation workflow
The finance-reconciliation agent operates on a knowledge base of 1,200+ bank-statement parsing rules, payment-counterparty mappings, and tax-jurisdiction matrices. The classical RAGAS metrics each scored 0.84-0.88 (A-grade) at deployment. However, the workflow exhibited a 19% failure rate that the team could not diagnose.
We added an experimental fifth RAGAS-extension metric — cross-document consistency — that measures whether multiple retrieved chunks from different documents agree on the answer. The cross-document consistency score was 0.41, far below the 0.75 conventional threshold. The cause was a known semantic ambiguity in finance terminology: "net" can mean post-tax in one document and post-discount in another; "reconciled" can mean three different things depending on which payment-processor's docs were authoritative.
The classical RAGAS metrics measured each retrieval against the question independently and gave high scores; the cross-document consistency exposed that the retrieved chunks contradicted each other. Remediation: introduce a "primary authority" tag per document so the retrieval prefers authoritative documents when contradictions exist; add a contradiction-detection pass that flags low-consistency retrievals for human review. Post-remediation: cross-document consistency rose from 0.41 to 0.78; finance-reconciliation failure rate dropped from 19% to 8%.
This case study introduces and validates the cross-document consistency extension to RAGAS — counterintuitive because the classical RAGAS metrics looked excellent and the failure mode was invisible to them. Cross-reference WSB-11 (anti-pattern catalog) lists "high-RAGAS-low-consistency" as anti-pattern AP-22.
EMPIRICAL DEEP-DIVE · §50 · STATISTICAL VALIDATION OF THE FOUR CLASSICAL RAGAS METRICS. We assessed the four classical RAGAS metrics (faithfulness, answer relevance, context precision, context recall) for inter-grader agreement, statistical power, and robustness on a 480-trace benchmark drawn from production workflows across lead-generation, finance, content-production, and voice-channel. Two trained graders (with a 6-hour calibration session against Es et al.'s rubric) coded each metric on a 5-point ordinal scale.
Inter-grader Krippendorff's alpha (the multi-coder generalization of Cohen's kappa appropriate for ordinal data) was 0.81 for faithfulness, 0.76 for answer relevance, 0.79 for context precision, 0.72 for context recall. The lowest agreement was on context recall, where the boundary between "complete enough" and "missing critical information" is judgment-laden. Disagreements concentrated where the question itself was ambiguous, suggesting context recall is partly a question-quality metric, not purely a retrieval-quality metric.
Statistical power: with n=480 and 5 ordinal levels, the design has 88% power to detect a 0.1-point difference between any two metrics at alpha=0.05. Bootstrap 95% confidence intervals on the mean per-metric scores were ±0.04 (faithfulness), ±0.05 (answer relevance), ±0.04 (context precision), ±0.06 (context recall). Sensitivity analyses: we re-ran the assessment with the alternative LLM-grader RAGAS implementation (using GPT-4-class model as grader) and found correlation r=0.89 with human-grader scores, lower than the r=0.94 reported in Es et al. (2024), suggesting that LLM-grader RAGAS is reliable but human-grader RAGAS is the gold standard for high-stakes evaluation.
Sensitivity to chunk-size variation (we re-chunked the source corpus at 500, 1000, 1500, 2000 tokens) showed that context recall is highly sensitive (range 0.62-0.88 depending on chunk size) while faithfulness is robust (range 0.86-0.91). This is consistent with the §47 finding: chunk-boundary alignment is one of the most consequential design choices in RAGAS-evaluated systems, and continuous-RAGAS must include chunk-size as a tracked variable.
IMPLEMENTATION ANTI-PATTERNS · §51 · FIVE FAILURE MODES IN RAGAS ADOPTIONS WE HAVE AUDITED. Across 9 teams the Madani Lab has advised on RAGAS adoption between Q2 2025 and Q1 2026, five anti-patterns recur. (1) ""Snapshot-only RAGAS audit"": teams run RAGAS once at deployment, report Grade-A, and assume the score is stable. As §48 shows, recall drift is the dominant long-horizon failure for persistent-memory workloads; snapshot RAGAS is approximately useless beyond month 3.
Remediation: deploy continuous-RAGAS with at minimum a 30-day rolling window; alert on drift exceeding 5 percentage points. (2) ""LLM-grader RAGAS without human spot-check"": teams use the LLM-grader RAGAS exclusively because it is cheaper. They cannot detect when the LLM-grader's calibration drifts (e.g., when the underlying LLM-grader model is silently upgraded). Remediation: human-grader spot-check on a 20-trace random sample monthly, validate LLM-grader correlation. (3) ""RAGAS as model selection criterion"": teams interpret low context recall as evidence the LLM is bad and propose switching models, when the failure is at the retrieval layer (chunk boundaries, embedding model, similarity threshold).
As in WSB-07, low metrics should not drive model swaps without ruling out retrieval-layer fixes. (4) ""Ignoring answer relevance"": teams over-index on faithfulness because hallucination is the high-profile failure, then ship systems with high faithfulness but low answer relevance (the agent retrieves correctly and doesn't hallucinate, but the answer doesn't match what the user asked). Remediation: enforce a minimum threshold on all four metrics, not just faithfulness. (5) ""Single corpus assumption"": teams run RAGAS on the original corpus and assume the scores generalize when the corpus is expanded. They do not — cross-document signal degrades nonlinearly with corpus size.
Remediation: re-run continuous-RAGAS whenever corpus size grows by >25% or when new document sources are added; chunk boundaries may need to be re-tuned.
CROSS-PILLAR INTEGRATION · §52 · WHERE RETRIEVAL-EVAL TOUCHES THE OTHER WAB PILLARS. RAGAS as Pillar P01-CONTEXT operationalization integrates densely with five neighbor pillars and has one notable conflict. Complementary integration with P03 Memory: continuous-RAGAS detects memory-induced retrieval drift before it surfaces as task failure, providing an early-warning signal for memory compaction or re-embedding.
Complementary integration with P06 Reliability: MAST classification of retrieval failures (FM-3.1 unhelpful tool output where the tool is retrieval) feeds back into RAGAS metric prioritization. The two pillars cross-validate as in §49: high-RAGAS-low-MAST and low-RAGAS-high-MAST are both signals of mis-calibration in one of the two systems. Integration with P09 Observability: continuous-RAGAS requires structured logging of every retrieval call with the chunks returned — a P09-L2 prerequisite.
Teams trying to ship continuous-RAGAS without P09-L2 cannot replay retrievals or audit drift. Integration with P05 Metacognition: the MetaCogAgent confidence on retrieval-grounded tasks correlates r=0.62 with the corresponding RAGAS faithfulness; the two signals can be fused as a 7th factor in the composite c-score, with weight derived empirically. Integration with P11 Auto-Improvement: RAGAS findings input the Dreams cycle's PROPOSE stage — chronic low-context-recall proposes new chunking strategies, chronic low-faithfulness proposes new retrieval thresholds.
Structural conflict with P02 Skills: as workflows scale skill count past ~30, each skill brings its own knowledge base; running continuous-RAGAS per-skill becomes expensive (linear cost in skill count), forcing a triage: prioritize the top-5 highest-traffic skills for full continuous-RAGAS, downsample the rest to monthly audits. This is a P02-vs-P01 budget tension that has no clean resolution at L4. The tension is the empirical justification for why the WAB weight on P01 Context is 8.33% rather than higher: P01 is necessary at L2 for any production workflow but its marginal cost at L4 scales superlinearly with skill count, so a workspace at high P02-L4 + medium P01-L3 outperforms uniform-L4 in cost-adjusted scoring.
OPEN RESEARCH QUESTIONS · §53
Falsifiable hypotheses the continuous-ragas approach opens up
(Q1) HYPOTHESIS: For workloads above a knowledge-base scale of 10,000 documents, snapshot RAGAS scores have less than r=0.30 correlation with 6-month-future RAGAS scores; for workloads below 1,000 documents, the correlation exceeds r=0.70. FALSIFICATION TEST: longitudinal RAGAS measurement on 20 workspaces stratified by corpus size. (Q2) HYPOTHESIS: Cross-document consistency (the §49 extension) is the most discriminating RAGAS metric for workloads where the knowledge base contains semantic conflicts; classical RAGAS scores fail to predict task success in such workloads. FALSIFICATION TEST: paired comparison of classical vs extended RAGAS predicting task success on a 100-workload benchmark across conflict-rich and conflict-poor domains. (Q3) HYPOTHESIS: LLM-grader RAGAS correlation with human-grader RAGAS decays at the rate of 0.02 per quarter as the underlying LLM-grader model evolves; un-recalibrated LLM-grader RAGAS becomes unreliable within 12 months.
FALSIFICATION TEST: 24-month longitudinal study with quarterly human-grader spot-checks across a fixed benchmark. (Q4) HYPOTHESIS: Chunk-size optimization minimizes context recall variance more than it minimizes faithfulness variance, suggesting RAGAS-aware chunkers should jointly optimize for both. FALSIFICATION TEST: 50-corpus benchmark, grid search over chunk sizes, measure both metrics' variance reduction. (Q5) HYPOTHESIS: Continuous-RAGAS's effectiveness depends on alert threshold tuning; over-sensitive thresholds (alert on <2 point drop) produce alert fatigue and remediation does not follow alerts; under-sensitive thresholds (alert on <10 point drop) miss drift until task failures occur. FALSIFICATION TEST: A/B study of threshold tuning across 6 teams. (Q6) HYPOTHESIS: RAGAS metrics computed on synthetic question-answer pairs generated by an LLM correlate r>0.80 with RAGAS metrics computed on real production traces, allowing cost-effective benchmarking.
FALSIFICATION TEST: paired synthetic-vs-production RAGAS measurement on 15 workflows. (Q7) HYPOTHESIS: Cross-document consistency thresholds calibrated on a single domain do not transfer; a workflow's domain expert must set the threshold per-domain, and a default 0.75 threshold mis-fires in 40% of new domains. FALSIFICATION TEST: 8-domain study with default-threshold deployment vs domain-calibrated threshold, measure mis-fire rate. (Q8) HYPOTHESIS: For long-lived agents, the dominant cause of RAGAS faithfulness decay is not memory bloat but embedding-model upgrade — silent re-embedding events introduce semantic drift larger than the cumulative effect of 12 months of memory growth. FALSIFICATION TEST: paired study with frozen embedding model vs upgrade-allowed embedding model, measure faithfulness decay rate over 12 months. (Q9) HYPOTHESIS: Adding a metacognitive self-assessment step before retrieval (where the agent verbalizes its uncertainty about the question) improves context-precision scores by >5 points without affecting context-recall, because uncertain queries trigger broader retrieval.
FALSIFICATION TEST: A/B comparison with and without pre-retrieval self-assessment on a fixed benchmark.
References
- [1]Es S., James J., Espinosa-Anke L., Schockaert S. (2024), RAGAS: Automated Evaluation of Retrieval Augmented Generation, EACL 2024 System Demonstrations, arXiv:2309.15217, Cardiff University NLP. open ↗
- [2]Lewis P. et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS.
- [3]Gao Y., Xiong Y., Gao X., Jia K., Pan J., Bi Y., Dai Y., Sun J., Wang H., Wang H. (2024), Retrieval-Augmented Generation for Large Language Models: A Survey, arXiv:2312.10997. open ↗
- [4]Saad-Falcon J., Khattab O., Potts C., Zaharia M. (2024), ARES: An Automated Evaluation Framework for RAG Systems, NAACL.
- [5]Niu C., Wu Y., Zhu J., Xu S., Shum K., Zhong R., Song J., Zhang T. (2024), RAGTruth: A Hallucination Corpus, ACL.
- [6]Yang Z., Qi P., Zhang S., Bengio Y., Cohen W., Salakhutdinov R., Manning C. (2018), HotpotQA, EMNLP.
- [7]Packer C., Wooders S., Lin K., Fang V., Patil S.G., Stoica I., Gonzalez J.E. (2023), MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560. open ↗
- [8]Park J., Cha S., Lee J. (2024), A-Mem: Agentic Memory for LLM Agents.
- [9]Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025. open ↗
- [10]Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
- [11]Malkov Y.A., Yashunin D.A. (2018), Efficient and Robust Approximate Nearest Neighbor Search Using HNSW, TPAMI.
- [12]LangChain (2024), LangSmith Documentation.
- [13]Arize AI (2024), Phoenix: Open-Source LLM Observability.
- [14]Anthropic (2025), Claude Sonnet 4.5 Technical Report.
- [15]OpenAI (2024), text-embedding-3-large Documentation.
- [16]Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
- [17]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning, arXiv:2604.02460. open ↗
- [18]Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
- [19]Madani Lab (2026), Continuous-RAGAS Reference Implementation v0.3.4 (Python, MIT).
- [20]explodinggradients (2024), ragas: open-source RAG evaluation framework, github.com/explodinggradients/ragas.
- [21]Madani Lab (2026), retrieval-pillar-policy.md v1.3.
- [22]Madani Lab (2026), memory-write-gate skill v1.0.
