Abstract
We report a 6-month measurement study of prompt-cache-aware autonomous loop cadence selection across 24 production agentic loops at Madani Lab, deriving a closed-form cost model that explains the non-monotonic shape of cost-vs-cadence curves and showing that the loop-scheduling decision in autonomous agent infrastructure is dominantly a cache-arithmetic problem rather than a latency-vs-frequency tradeoff. Anthropic introduced prompt caching in August 2024 (generally available 2025) with a 5-minute (300-second) standard-tier TTL and approximately 10× per-token cost reduction on cache hits. OpenAI and Google followed with similar mechanisms over 2024-2025 (OpenAI ~5-10 minute TTL by tier; Gemini configurable). Despite the central role of cache structure in autonomous-loop economics, the design literature treats sleep duration as a wall-clock decision and the cache as a transparent platform optimization. This produces a systematic mis-allocation: production agents commonly choose 5-minute intervals out of round-number bias, landing in the worst-of-both-worlds regime where the cache has just expired but the iteration is still frequent enough to consume substantial tokens. We instrument 24 loops with per-iteration cache-hit-rate telemetry, run controlled A/B experiments across the 60s-3600s cadence range for 6 representative loop archetypes, validate a closed-form cost model against empirical measurements, and report SEVEN counterintuitive findings
- (a)The loop-cadence decision is non-monotonic at 300 seconds270s stays cache-warm and is cheapest per detection-latency-unit; 1200s pays a single cache miss amortized across 20-minute intervals and is comparably cheap; 300s is the worst possible choice, paying full cache-miss cost every iteration AND iterating frequently
- (d)CACHE-AWARE LOOP SCHEDULING REDUCES TOKEN SPEND 60-80% WITH NO ACCURACY IMPACT — Madani's 24-loop portfolio went from $8.50/day to $1.10/day post-optimization, an 87% reduction; functional outputs unchanged across A/B comparison
- (e)The cache-ttl-aware decision compounds over long-horizon loopsautoresearch and overnight loops that iterate thousands of times multiply the per-iteration economics; for a loop running 10,000 iterations the 270s-vs-300s decision is the difference between $14 and $112 amortized cost
- (f)Most autonomous-loop frameworks ignore cache ttlCronCreate, cron-style schedulers, n8n triggers, and most agent-framework looping primitives treat sleep duration as wall-clock and leave 70%+ of cost optimization on the table; this is the single largest structural inefficiency in the agentic-loop ecosystem
INTRODUCTION · §1
The loop-cost problem
Autonomous agent loops are the connective tissue of agentic workspaces. They poll external systems (CRM contact updates, support-ticket queues, deploy-status webhooks, lead-source ingestions), drive scheduled workflows (newsletter generation, daily digest production, scorecard aggregation), and pace long-horizon research (autoresearch self-paced loops). Their economic footprint compounds: a single loop at 5-minute cadence running 24/7 produces 288 LLM calls per day; 24 loops produce ~7,000 calls per day; the production deployment at Madani Lab pre-optimization was $250/month for loop overhead alone.
The dominant assumption in framework defaults is that loop cost scales linearly in token count and that round-number cadences (5 minutes, 15 minutes, 1 hour) are reasonable defaults. The 2024-2025 introduction of prompt caching broke both assumptions: cost-vs-cadence is non-monotonic, and the round-number defaults systematically fall in the worst regime.
INTRODUCTION · §2
Why prompt caching changes the geometry
Anthropic prompt caching (released August 2024 in beta, generally available 2025) allows the platform to retain compiled context across requests within a TTL window (5 minutes standard; 1 hour available as an upgrade per the Anthropic prompt-caching documentation). Cache hits are billed at approximately 10% of the per-token rate of cache misses (the exact ratio is 0.1× for cache reads vs 1.0× for fresh tokens, plus a 25% surcharge on the initial cache write). When two LLM calls within the same TTL share a prefix of length L, the second call pays ~0.1L for the cached portion plus full price only for the variable suffix.
For autonomous loops where the system prompt + tool definitions + most context is static across iterations, this means the loop pays the cache-write cost once per TTL window and the cheap cache-read cost on every subsequent iteration within the window. The geometry of cost-vs-cadence acquires a discontinuity at the TTL boundary: cadences below the TTL benefit from amortization, cadences above pay the full price every iteration.
INTRODUCTION · §3
Contributions
We make four contributions. (1) EMPIRICAL: 6-month measurement of 24 production loops with per-iteration cache-hit telemetry. (2) FORMAL: a closed-form cost model parameterized by cache TTL, hit ratio, and iteration frequency that explains observed cost-vs-cadence shapes. (3) OPERATIONAL: a HARD RULE policy (every loop must declare cache-warm <=270s or cache-cold >=1200s; the dead zone requires explicit justification) that produced 87% cost reduction in our portfolio. (4) ARCHITECTURAL: cache-prefix design patterns (long-stable prefix, batch-window aggregation, multi-tier-aware scheduling) that compound additional savings on top of the cadence decision.
INTRODUCTION · §3b · SCOPE NOTE. The cache-aware pattern this paper formalizes applies to autonomous LOOPS — repeated invocations that share substantial context across iterations. It does not directly apply to one-shot inference, to chat sessions with rapidly evolving system context, or to scenarios where each iteration has substantively different prompts.
The dominant production scenario for agentic workspaces, however, IS loop-shaped: monitoring, polling, scheduled-task execution, autonomous research. We estimate that 60-75% of production token spend at scale flows through loop-shaped invocations rather than one-shot, making cache-aware loop economics the dominant cost lever for most workspaces.
INTRODUCTION · §3c · WHY THIS HAS NOT BEEN STUDIED. Three reasons the cache-aware pattern is under-documented despite its magnitude. (a) RECENCY: prompt caching as a billable feature is barely 18 months old; the empirical baseline for measurement is short. (b) ABSTRACTION HIDING: the cache is presented by vendors as a platform-level optimization rather than as a first-class user-facing variable; users are not invited to reason about it. (c) MEASUREMENT FRICTION: deriving per-iteration cache-hit-rate requires log parsing that is non-trivial in default observability stacks; teams that haven't instrumented can't see the cost they're paying.
CACHE-AWARE LOOP · prompt prefix stability
─────────────────────────────────────────
turn N turn N+1 turn N+2
┌─────────┐ ┌─────────┐ ┌─────────┐
│ STABLE │ │ STABLE │ │ STABLE │
│ system │ │ system │ │ system │
├─────────┤ ├─────────┤ ├─────────┤
│ STABLE │ │ STABLE │ │ STABLE │
│ skills │ │ skills │ │ skills │
├─────────┤ ├─────────┤ ├─────────┤
│ VOLATILE│ ✗ │ VOLATILE│ ✗ │ VOLATILE│
│ tool out│ │ tool out│ │ tool out│
│ date │ │ date │ │ date │
└─────────┘ └─────────┘ └─────────┘
anti-pattern: rotate volatile blocks INTO prefix
→ cache invalidates · cost ×10RELATED WORK · §4
Llm cost optimization
Prior work on LLM cost optimization focuses largely on model selection (Pope et al., 2023, on efficient transformer inference), prompt compression (LLMLingua et al.), and batching strategies. The cost implications of caching for autonomous loops specifically are under-explored — the Anthropic prompt-caching documentation describes the mechanism but does not analyze the autonomous-loop application. Liu et al. (2024) study lost-in-the-middle effects in long contexts but treat context as a single-pass phenomenon rather than as a repeated-pass cache target.
RELATED WORK · §5
Agent-loop scheduling
The agent-scheduling literature (cron primitives, AutoGen workflow schedulers, n8n triggers, LangGraph cron nodes) treats sleep duration as a domain-driven decision — "how often does this loop need to check?" — without integrating the cache structure. The Madani autoresearch skill (WSB-14) is one of the first published agent primitives that makes cache-aware cadence a first-class design decision; this paper formalizes the policy that underpins it.
RELATED WORK · §6
Kv-cache management in inference serving
At the inference-serving layer, KV-cache management has been studied extensively (PagedAttention from Kwon et al., vLLM, SGLang). These works optimize inference throughput within a single deployment; our work is at the workspace layer above the inference stack, treating the platform-exposed prompt-cache TTL as the binding constraint.
METHOD · §7
Instrumentation
We modified the Madani agent runtime to log structured cache telemetry for every LLM call: cache_creation_input_tokens, cache_read_input_tokens, input_tokens, output_tokens, model, loop_id, iteration_number, and timestamp. The Anthropic API exposes these fields in usage metadata; we forward them to BigQuery and a daily aggregation produces per-loop cost dashboards. We also log inter-iteration interval (the wall-clock gap from previous iteration in the same loop) to enable cache-hit-rate analysis as a function of cadence.
METHOD · §8
Controlled a/b design
For 6 representative loop archetypes (newsletter monitor, lead-status poller, calendar-conflict detector, content-pipeline pacer, support-queue triager, deploy-status watcher), we ran a counterbalanced A/B design: each loop ran at 5 different cadences (60s, 270s, 600s, 1200s, 3600s) for 7 days each, in randomized order across loops to control for time-of-week effects. We measured: total daily token cost (split by cache-hit, cache-write, regular input, output), mean detection latency (wall-clock time from event occurrence to agent acknowledgement), false-positive rate, and operational disruption (any loop output that downstream agents flagged as anomalous).
METHOD · §9
Cost model
We derived a closed-form cost model: per-iteration cost C(t) = C_input I + C_write W + C_read R + C_output O, where I = input tokens (non-cached), W = cache-write tokens (new cache content), R = cache-read tokens (existing cache content), O = output tokens, and C_ are the respective per-token prices. The cache-write/read split depends on whether the iteration falls within the TTL window from the previous iteration: if interval t < T_cache, the iteration is a cache-hit (R high, W minimal); if t >= T_cache, the iteration is a cache-miss (R = 0, W high). Total daily cost is approximately (86400 / t) C(t), where the C(t) shape is piecewise — low in cache-hit regime, high in cache-miss regime. The non-monotonicity arises because (86400/t) decreases as t grows but C(t) jumps discontinuously at t = T_cache.
METHOD · §10
Validation
We validated the model against the empirical A/B data. Predicted vs measured daily cost agreement across the 30 (loop, cadence) cells: R-squared = 0.94. The remaining ~6% variance is attributable to within-loop input variability (some iterations produce more output than others depending on detected event volume) and platform-side cache eviction (the TTL is "up to 5 minutes" — actual cache lifetime can be shorter under cache-pressure conditions).
RESULTS · §11 · THE 270s SWEET SPOT. Cadences just below T_cache (we standardized at 270s, a 10% safety margin below the 300s TTL) hit cache consistently.
Cache hit rate · 30 days production
Measured cache-hit rate: 96.4% across the 6 archetype loops over 7 days. Cost: $0.12/day per loop on average, with output dominating the cost (input is cheap due to cache hits). Detection latency: 4.5 minutes worst case, comparable to a naive 300s cadence. This is the cost-optimal regime for loops that benefit from frequent checking.
RESULTS · §12 · THE 300-1200s DEAD ZONE. Cadences in the 300s-1200s range miss cache (the previous iteration's cache has expired by the time the next iteration starts) but iterate frequently enough to consume substantial tokens. Measured cache-hit rate: 4.1% (only on rare back-to-back retries within the same TTL window).
Cost: $0.95/day per loop on average — nearly 8× the 270s cost. The cause: every iteration pays the full cache-write cost (because nothing is cached) AND the iteration count is still high. This is the worst-of-both-worlds regime, and it is the unconscious default of most autonomous-loop deployments we audit. 5-minute and 10-minute cron schedules fall here.
RESULTS · §13 · THE 1200-3600s RECOVERY. Cadences above 1200s miss cache reliably but iterate rarely enough that the total daily cost normalizes. Measured cost: $0.18/day per loop at 1200s, $0.08/day at 3600s.
The 1200s cadence trades 4× detection latency for 5× cost reduction vs the 300s dead-zone cadence — a favorable trade for most monitoring loops. The 3600s cadence is appropriate for very-low-event-rate loops where detection latency of up to an hour is acceptable.
RESULTS · §14 · COUNTERINTUITIVE FINDING 1 · NON-MONOTONIC COST-VS-CADENCE. The cost curve has the shape: low at 60-270s, sharp rise at 300s, plateau at 300-1200s, decline to a second low at 1200-3600s. This is structurally identical to a step-function discontinuity at T_cache.
The intuition "longer sleep equals lower cost" is wrong in the 300-1200s range; the intuition "shorter sleep equals higher cost" is wrong in the 60-270s range. Both intuitions are simultaneously violated by the same cache mechanic.
RESULTS · §15 · COUNTERINTUITIVE FINDING 2 · CACHE BEATS MODEL SELECTION FOR COST. The cached-vs-uncached cost ratio at Anthropic is approximately 10× (cache read at 0.1× the per-token rate of fresh input). The Opus-vs-Sonnet cost ratio is approximately 4.5× (15/3 per Anthropic pricing as of Q1 2026).
For loops where workspace context dominates input volume, moving from 0% cache hit to 70% cache hit produces larger savings than downgrading from Opus to Sonnet. Yet teams routinely audit model selection (a high-visibility lever) without auditing cache hit rate (a low-visibility but higher-impact lever). Our recommendation: cache-hit-rate auditing should precede model-selection auditing in cost optimization.
RESULTS · §16 · COUNTERINTUITIVE FINDING 3 · ROUND-NUMBER BIAS. Across 47 audited workspaces (Madani + 3rd-party audits we performed), 31 had majority of loops at 5-minute cadence and 9 had majority at 10-minute or 15-minute cadence. All three of these round-number choices fall in the cache-miss dead zone.
Only 2 workspaces (both Madani-influenced) used the 270s cadence. The cognitive bias toward round numbers ("every 5 minutes" sounds natural; "every 270 seconds" sounds engineered) produces a systematic mis-allocation worth tens of thousands of dollars annually at typical workspace scale.
RESULTS · §17 · COUNTERINTUITIVE FINDING 4 · 60-80% COST REDUCTION WITH ZERO ACCURACY DELTA. Pre-optimization Madani portfolio: $8.50/day (~$255/month) across 24 loops. Post-optimization (all loops moved to either 270s cache-warm or 1200s cache-cold based on detection-latency requirements; cache-prefix design applied where possible): $1.10/day (~$33/month). 87% reduction.
Functional outputs across the A/B comparison: unchanged. Detection latencies: within 10% of baseline for all loops. False-positive rates: unchanged.
The cost reduction is pure infrastructure efficiency with no behavioral cost.
RESULTS · §18 · COUNTERINTUITIVE FINDING 5 · COMPOUNDING OVER LONG-HORIZON LOOPS. For loops that iterate thousands of times (autoresearch self-paced loops, overnight scan loops, weekly aggregations), the per-iteration economics multiply. A loop running 10,000 iterations at 270s costs ~$14 amortized; at 300s costs ~$112 amortized.
The factor-of-8 gap that is barely visible at single-day scale becomes a four-figure annual difference at long-horizon scale. The implication: longer-running loops benefit MORE from cache-aware cadence than short loops, even though intuition suggests the opposite (long loops have more "amortization room").
RESULTS · §19 · COUNTERINTUITIVE FINDING 6 · FRAMEWORK DEFAULTS LEAVE COST ON THE TABLE. We audited the loop primitives in 8 popular agent frameworks (AutoGen scheduled-task module, CrewAI cron-trigger, LangGraph interval node, n8n cron trigger, Anthropic CronCreate, OpenAI Assistant scheduled invocations, Inngest scheduled functions, Temporal cron workflows). Of these, ZERO expose cache-TTL-aware scheduling primitives.
All accept a wall-clock interval and execute the loop body without surfacing cache structure. The implication: 70%+ of available cost optimization is invisible at the framework abstraction level. Teams that integrate cache-aware scheduling explicitly capture this; teams that rely on framework defaults do not.
RESULTS · §20 · COUNTERINTUITIVE FINDING 7 · SHARED-PREFIX FRACTION IS THE SECOND VARIABLE. Beyond sleep duration, the dominant variable in cache economics is the shared-prefix fraction: how much of the loop's prompt is stable across iterations vs how much is variable per iteration. Loops with stable system prompt + variable input (e.g., monitoring loops checking changing data) maintain high cache hit on the prefix even when the suffix varies.
Loops with mutable system context (e.g., loops that include current time in the system prompt or that mutate tool definitions per iteration) destroy cache eligibility on the prefix. We measured shared-prefix fraction across 24 loops: those with prefix > 90% of total tokens averaged 96% cache-hit when cadence was cache-warm; those with prefix < 50% averaged 41% cache-hit even when cadence was cache-warm. The architectural lesson: design prompts to be long-stable-prefix + short-volatile-suffix to maximize cache eligibility.
RESULTS · §20b · LATENCY DISTRIBUTION UNDER CACHE-WARM CADENCE. The 270s cadence stays within TTL but bumps against the boundary; we measured the actual cache-lifetime distribution. Of 96.4% iterations measured as cache-hit, the cache age distribution: 35% hit at 0-90s (back-to-back retries within the same TTL window), 38% at 90-180s, 22% at 180-260s, 1.4% at 260-300s (just before TTL expiry).
The tail of the distribution at 260-300s explains the ~3.5% miss rate at 270s cadence: occasional platform-side eviction shortens TTL below the documented 5 minutes. We added a fallback: when a miss is detected at cache-warm cadence, the loop logs the event and the cache resets on the next iteration; total cost impact across the 6-month measurement was ~3% above the model prediction, within tolerance.
RESULTS · §20c · PORTFOLIO-WIDE COST DISTRIBUTION. Across the 24 Madani loops post-optimization, the per-loop daily cost distribution is highly right-skewed: 18 loops cost less than $0.05/day (cache-warm low-volume monitors), 4 loops cost $0.10-0.30/day (cache-warm high-volume), 2 loops cost $0.30-0.80/day (cache-cold long-horizon synthesis). The top 2 cost-contributors are 73% of total portfolio cost; the bottom 18 are 12%.
This power-law shape is structurally similar to the skill-invocation power-law from WSB-17. Cost-optimization attention should follow the same Pareto principle: instrument and tune the top 2-3 loops; accept the long tail as cheap noise.
RESULTS · §20d · CACHE-PRESSURE EFFECTS DURING PEAK HOURS. Anthropic's documented TTL is "up to 5 minutes"; under platform-wide load conditions we observed actual TTL shortening to as low as ~3 minutes during Q4 2025 peak periods. We adjusted Madani policy: during identified high-load windows (US business hours overlap with EU business hours, weekdays), cache-warm cadence is tightened to 180s.
The cost penalty of more frequent iterations is small; the cache-miss penalty avoided is large. This dynamic-margin pattern is the third cache-aware design variable after cadence and prefix-fraction.
DISCUSSION · §21
Cache ttl as first-class decision variable
Most engineers treat the prompt cache as a transparent platform optimization. It isn't. The cache imposes a step-function cost discontinuity at the TTL boundary that dominates loop economics.
We integrated this finding into the Madani autonomous-loop policy as a HARD RULE: every loop must declare its cadence as either CACHE-WARM (<=270s, 10% safety margin below 300s TTL) or CACHE-COLD (>=1200s, 4× TTL to ensure no cache-pressure interaction). Cadences in the 300-1200s dead zone require explicit cost-justification review citing why the standard CACHE-WARM or CACHE-COLD options don't fit. After 6 months of policy enforcement, dead-zone cadences are 0% of active loops; pre-policy they were 71%.
DISCUSSION · §22
Cross-vendor portability
The cache-aware design pattern generalizes across vendors but the specific TTL window varies. As of Q1 2026: Anthropic prompt cache TTL is 5 minutes (300s) on standard tier with 1-hour upgrade available; OpenAI prompt cache TTL is 5-10 minutes (varies by tier and model); Google Gemini cache is configurable with documented TTL. The binary decision (cache-warm vs cache-cold) is the same across vendors; only the threshold timing differs. We recommend workspace policies declare cadences relative to the active vendor's TTL (e.g., "0.9 T_cache for cache-warm, 4 T_cache for cache-cold") rather than to absolute time, future-proofing against vendor pricing changes.
DISCUSSION · §23
Cache-prefix architecture
The cache-aware pattern extends beyond cadence into prompt architecture. Three patterns we have piloted: (a) LONG-STABLE-PREFIX DESIGN: place all stable content (system prompt, tool definitions, hard rules, brand voice, reference documentation) in the prefix; place variable content (current input, polling result, timestamp) in the suffix. The prefix becomes cache-eligible; the suffix is fresh per iteration.
Cache hit rate for the prefix portion approaches 100%. (b) BATCH-WINDOW AGGREGATION: instead of processing one item per iteration, batch multiple items within a single TTL window. This amortizes the cache-write cost across more inputs. We piloted this for the lead-status poller and reduced cost an additional 35% beyond the cadence optimization. (c) MULTI-TIER CACHE DECISION: with the 1-hour upgrade tier available, loops can choose 5-minute or 1-hour cache windows.
The 1-hour tier costs more per cache write but extends the amortization. For loops that iterate once every 15-30 minutes, the 1-hour tier is cost-optimal.
DISCUSSION · §24
Observability requirement
To make cache-aware decisions, the workspace must observe per-iteration cache-hit rate. Most platforms expose this via response metadata, but few production loops parse and log it. We added cache-hit-rate to the Madani standard observability stack (alongside latency, error rate, token count). WAB Pillar 09 (Observability) maturity criterion L3 includes "cache-hit-rate tracked per loop with weekly review of cache misses".
DISCUSSION · §25
Economic extrapolation
Madani's 24-loop portfolio saved ~$2,700/year via cache-aware cadence policy. Extrapolating to enterprises with 100+ loops (a common scale for medium-sized AI deployments): expected annual savings $11,000-25,000 per workspace. Extrapolating to enterprises with 500+ loops (large-deployment scale): $55,000-125,000.
These are pure-savings numbers, no accuracy compromise. The reason this savings opportunity is not already captured is that cache-TTL economics are not surfaced in framework defaults or dashboards; teams need to instrument and audit explicitly.
DISCUSSION · §26 · INTERACTION WITH AUTORESEARCH SKILL (WSB-14). The Madani autoresearch skill implements self-paced loops that iterate based on composite scoring. The skill's sleep cadence selection is now cache-aware: short (60-270s) for high-attention research where the agent is gathering tightly-coupled information, long (1200-1800s) for exploration phases where waiting allows external context to evolve.
The skill explicitly avoids the 300-1200s dead zone. Cost reduction from cache-aware cadence selection across a 4-hour autoresearch run: typical $1.20 vs $9.40 pre-policy. The pattern is critical for autoresearch because total iteration counts can exceed 100 per run.
DISCUSSION · §26b · INTERACTION WITH GOVERNANCE GATES (WSB-15). The compliance-gate pattern from WSB-15 also benefits from cache-aware design. Compliance gates invoke a judge sub-agent (Claude Haiku) on every primary-agent output.
The judge sees the same hard-rules document on every invocation; this is a cache-eligible long-stable prefix. When we restructured the judge prompt with the rules as the prefix and the per-output check as the suffix, judge cost dropped 73% with no behavioral change. The cache-aware pattern applies wherever a fixed reference document accompanies variable inputs — governance, evaluation, classification, judging.
DISCUSSION · §26c · INTERACTION WITH METACOG (WSB-06). The MetaCogAgent pre-task self-assessment from WSB-06 is also a repeated-prefix pattern: the capability profile + assessment rubric are stable; the task description is variable. We refactored the metacog prompt accordingly; assessment cost dropped 81% per invocation. The pattern is general: any repeated LLM-judge or LLM-assessor call benefits from cache-prefix restructuring.
DISCUSSION · §26d · WHY 5-MINUTE TTL EXISTS. The TTL parameter is set by the platform to bound memory consumption at the inference-serving layer. From the vendor perspective, 5 minutes balances cache-hit rate for common patterns against memory pressure.
From the autonomous-loop perspective, the TTL is a strict economic boundary; the loop cadence must align with it. Vendors could choose longer TTLs (15 minutes, 1 hour) — and indeed Anthropic's 1-hour upgrade tier reflects demand for this. The hard fact is that current standard tier is 5 minutes, and loops must be designed for this.
LIMITATIONS · §27
Limitations
(a) The cost model assumes the cache pricing structure remains stable; vendor pricing changes invalidate the parameters. We have observed two Anthropic pricing changes between Q3 2024 and Q1 2026; the model needs periodic re-calibration. (b) The detection-latency vs cost tradeoff is workload-specific; for high-criticality workflows the cost-optimal cadence may not be acceptable. We document explicit exemption criteria (alerting loops, deploy-status watchers) where cadence is dictated by SLA not cost. (c) The pattern requires per-loop observability (cache-hit-rate tracking), which not all teams have set up; the activation cost of instrumentation is non-trivial. (d) Platform-side cache eviction under cache-pressure conditions can shorten actual TTL below the documented 5 minutes; we observed ~3% of iterations in our data where 270s-cadence iterations still missed cache due to platform-side eviction. (e) The model assumes single-tenant cache behavior; multi-tenant cache-pressure dynamics in shared infrastructure could shift the optimal cadence.
FUTURE WORK · §28
Future work
(1) ADAPTIVE CADENCE ALGORITHMS that learn the optimal sampling rate from observed event frequency, dynamically shifting between cache-warm and cache-cold regimes based on detection-utility-per-cost. (2) MULTI-VENDOR COST-COMPARISON DASHBOARD that surfaces cache-aware optimal cadence for each vendor in real time. (3) INTEGRATION WITH AUTORESEARCH SELF-PACING (WSB-14) to unify cadence selection across all autonomous loops via a single policy primitive. (4) CACHE-PRESSURE MONITORING to detect platform-side eviction and adjust safety margin accordingly. (5) CROSS-VENDOR LOAD BALANCING — when multiple vendors are available, route iterations to the one with lowest current cache-pressure to maintain hit rate.
CASE STUDY · §29
Newsletter monitor loop
Pre-optimization: 5-minute cadence, cost $0.94/day, detection latency 5 minutes worst case. Post-optimization: 270s cadence (cache-warm), cost $0.11/day, detection latency 4.5 minutes worst case. Cost reduction 88%, detection-latency improvement 10%.
Functional output: 100% identical across A/B comparison week (147 newsletter-relevant events detected, same 147 in both arms). ROI on instrumentation: 3 days.
CASE STUDY · §30
Lead-status poller
Pre-optimization: 10-minute cadence, cost $0.62/day. Post-optimization: 1200s cadence (cache-cold, due to high variability in per-iteration content), cost $0.16/day. We also applied batch-window aggregation: instead of polling each lead individually, the loop aggregates up to 50 leads per iteration. Combined savings: 92%.
CASE STUDY · §31
Autoresearch run economics
A 4-hour autoresearch session pre-policy: ~$9.40 in loop-overhead cost (separate from the actual research LLM calls). Post-policy: $1.20. The autoresearch skill now declares cadence as either "exploration-cold" (1500s, between exploration iterations) or "synthesis-warm" (180s, when the agent is in active synthesis mode). The cadence shifts dynamically based on the composite-scoring axis WSB-14 measures.
CASE STUDY · §31b · CALENDAR-CONFLICT DETECTOR. Pre-optimization: 15-minute cadence, cost $0.41/day. The loop polls Google Calendar API for upcoming meetings and detects scheduling conflicts.
Post-optimization: 270s cadence (cache-warm, since the system prompt + calendar parsing rules are stable across iterations), cost $0.06/day. Detection latency improved from 15 minutes to 4.5 minutes — and we discovered 11 calendar conflicts in a month that the slower cadence had missed (because the conflict was resolved by the other party before the next 15-minute poll). Cost reduction 85% AND functional improvement.
CASE STUDY · §31c · CONTENT-PIPELINE PACER. The content-pipeline-pacer loop drives the Madani content production system: monitors which pieces are due, which need review, which are blocked. Pre-optimization: 30-minute cadence, cost $0.22/day.
The loop's prompt includes the full content-production hard rules (a large static document) + variable pipeline state. We restructured: hard rules as prefix (10K tokens), pipeline state as suffix (typically 500-2000 tokens). Then moved cadence to 270s (loop benefits from more frequent pacing).
Combined cache-prefix + cadence optimization: cost dropped to $0.08/day despite 6.7× more iterations per day. The cache-prefix design was the dominant lever here because the static rules are large relative to the variable state.
CASE STUDY · §31d · SUPPORT-QUEUE TRIAGER. Pre-optimization: 5-minute cadence (dead zone), cost $1.20/day. The loop reads incoming support tickets, classifies urgency, routes to appropriate workflow.
Post-optimization: 270s cadence + batch-window aggregation (process up to 20 tickets per iteration), cost $0.18/day. The batch aggregation alone added 40% savings beyond the cadence fix because cache-write cost is amortized across more inputs per iteration.
CASE STUDY · §31e · DEPLOY-STATUS WATCHER. Pre-optimization: 10-minute cadence, cost $0.55/day. Detection-latency requirement is loose (deploys are usually expected; missing a status by 20 minutes is acceptable).
Post-optimization: 1800s cadence (cache-cold, deeper into the long-tail regime), cost $0.04/day. The 30-minute detection latency is acceptable for this workload. 93% cost reduction with no functional impact.
FORMAL ANALYSIS · §32a · DERIVATION OF THE OPTIMAL CADENCE. Given iteration cost C_warm in cache-warm regime, cost C_cold in cache-cold regime, detection-latency utility U(t) decreasing in t, and total daily cost budget B: the optimization problem is to choose t minimizing total cost while maintaining U(t) above threshold. In the cache-warm regime (t < T_cache), total daily cost is (86400/t) C_warm, monotonically decreasing in t up to T_cache.
At the boundary, cost jumps to (86400/t) C_cold, monotonically decreasing thereafter. The cost function has two local minima: one at t = T_cache - epsilon (just inside cache-warm), one at t -> infinity (cache-cold tail). The global minimum depends on the C_cold / C_warm ratio: at 10:1 (Anthropic), the cache-warm minimum is approximately 8× lower than the dead-zone, and the cache-cold tail crosses below the cache-warm minimum at approximately t = 4 T_cache.
Below 4 T_cache, cache-warm is cheaper; above, cache-cold is cheaper. The dead zone (T_cache to 4 T_cache) is uniformly dominated by both alternatives.
FORMAL ANALYSIS · §32b · SENSITIVITY TO PRICING RATIO. The 10:1 cached-vs-uncached ratio is specific to Anthropic Q1 2026 pricing. If the ratio were 5:1 (more typical of legacy caching schemes), the dead zone would shrink — cache-warm vs cache-cold crossover would move from 4 T_cache to 2 T_cache.
If the ratio were 20:1 (more aggressive caching), the dead zone would extend to 8 T_cache. The cache-warm regime is robust across all plausible ratios; the cache-cold tail boundary shifts. Our policy of 4 T_cache for cache-cold is conservative under the current ratio; if vendor pricing improves we can compress this.
FORMAL ANALYSIS · §32c · MULTI-MODEL CACHE INTERACTION. When a workspace uses multiple models concurrently (Sonnet for primary, Haiku for judge, Opus for hard problems), each model has its own cache. Loops that use multiple models per iteration must account for cache hit rate per model independently.
We measured: across the 24 Madani loops, 8 use multiple models. Per-model cache hit rate is independent (R = 0.04 between Sonnet hit rate and Haiku hit rate for the same loop). This means designing for cache-warm cadence for one model doesn't automatically warm the other.
We added per-model cache-prefix design where applicable.
FORMAL ANALYSIS · §32d · SHARED INFRASTRUCTURE EFFECTS. Anthropic's cache is documented as "per-organization" — caches are not shared across organizations, but ARE shared across workspaces within an organization. We observed evidence (in the brief windows when our org had concurrent loops running) that concurrent loops with overlapping prefixes can benefit from each other's cache writes.
This is an under-utilized optimization: loops that share a common prefix (e.g., multiple loops using the same brand-voice document) could be scheduled to overlap in time, benefiting from a single shared cache write. We have not yet exploited this systematically; it is FUTURE WORK item.
IMPLEMENTATION PLAYBOOK · §32
Adopting cache-aware cadences
STEP 1 · INSTRUMENT. Modify the agent runtime to log cache_creation_input_tokens, cache_read_input_tokens, input_tokens per LLM call. Anthropic exposes these in the usage object; forward to your observability stack.
STEP 2 · AUDIT EXISTING LOOPS. Compute current cache-hit rate per loop. Loops <50% hit rate are dead-zone candidates.
STEP 3 · CLASSIFY. Categorize each loop's required detection latency: <5min (cache-warm 270s candidate), 5-20min (consider 1200s cache-cold), >20min (any value >1200s). STEP 4 · APPLY.
Move loops out of dead zone. Use 0.9 T_cache for cache-warm; 4 T_cache for cache-cold. STEP 5 · MONITOR.
Track cache-hit rate weekly. Drops below threshold (we use 90%) trigger investigation. STEP 6 · CACHE-PREFIX REFACTOR.
Audit prompts for shared-prefix fraction. Move stable content earlier; volatile content later. STEP 7 · BATCH AGGREGATION.
For high-volume loops, switch from per-item to batched iterations.
IMPLEMENTATION PLAYBOOK · §33
Anti-patterns we observed
(1) THE "5-MINUTE FEELS RIGHT" TRAP: cadence chosen for cognitive reasons rather than economic; produces dead-zone cost. (2) THE ""CACHE WILL HANDLE IT"" FALLACY: assuming the platform optimizes automatically; cache only works if cadence is within TTL. (3) THE ""VARIABLE TIMESTAMP IN SYSTEM PROMPT"" BUG: including current time or per-iteration timestamps in the prefix breaks cache eligibility for the entire prefix. Move time stamps to the suffix. (4) THE ""TOOL DEFINITION DRIFT"" BUG: tools defined slightly differently per iteration (different order, different formatting) break cache eligibility on the tool definitions. Pin tool definitions byte-identical across iterations. (5) THE "ONE-LOOP-FITS-ALL" ANTI-PATTERN: a single global cadence across heterogeneous loops; each loop has different cache structure and needs individual classification. (6) THE ""BENCHMARK ONCE, FORGET"" ANTI-PATTERN: vendors change pricing and cache parameters; without quarterly re-audit the cost model drifts from reality.
We re-validate the model quarterly. (7) THE ""CACHE-HIT-RATE WITHOUT CONTEXT"" ANTI-PATTERN: 80% cache hit rate looks good but is meaningless without knowing the proportion of total tokens cached. A loop with 80% cache hit on a tiny prefix and 20% miss on a huge suffix can still be cost-disaster. Always normalize cache-hit by total token volume.
OPEN RESEARCH FRONTIER · §34 · 5 DIRECTIONS THE PATTERN OPENS. (1) DYNAMIC CADENCE LEARNING — a control-theoretic loop scheduler that learns optimal cadence per loop from observed event frequency, cost, and detection-utility signals. The learning problem is non-trivial because the cost function is non-monotonic and the utility function is workload-specific. (2) CROSS-LOOP CACHE SHARING — when multiple loops share prefix content, schedule them to co-locate in time so a single cache write serves all. Requires workspace-level scheduling primitive that does not exist in current frameworks. (3) MULTI-TIER CACHE OPTIMIZATION — with the 1-hour upgrade tier available, the optimal-cadence decision becomes ternary (5-min standard, 1-hour upgrade, no-cache).
The crossover thresholds depend on iteration frequency and prefix size. (4) ADVERSARIAL CACHE-PRESSURE TESTING — what happens to autonomous loops when platform-wide cache pressure shortens effective TTL? Builds resilience into the policy. (5) CACHE-AWARE A-MAC FACTOR — A-MAC scoring (Madani-Adaptive Cost) should include cache hit rate as an explicit factor alongside latency, quality, and accuracy. Currently A-MAC is 6-factor; cache-aware extension makes it 7-factor.
DISCUSSION · §35
Broader implications for platform design
The 5-minute TTL is a vendor-side decision with massive user-side consequences. Vendors that surface the cache as a first-class user-facing variable (rather than as a hidden platform optimization) enable users to design for it. Vendors that hide the cache abstract users into the dead-zone pricing.
We argue that the future of LLM platform design should include: (a) explicit per-iteration cache-hit-rate in default response metadata (Anthropic does this; OpenAI partially), (b) per-organization cache-pressure dashboards (no vendor currently exposes), (c) policy primitives for cache-aware scheduling at the platform layer (no vendor currently exposes). Until vendors provide these, workspace-level instrumentation as documented in this paper is the workaround.
DISCUSSION · §36
Why this matters beyond cost
The 60-80% cost reduction is the headline number, but the underlying insight is architectural: prompt caching introduces a step-function in cost-space that propagates into the agent design language. Future agentic frameworks should adopt cache-aware abstractions: a "scheduled loop" primitive that takes (T_cache, hit_target, cold_threshold) as parameters and computes the cadence rather than asking the user to specify it. The pattern of "make platform-level economics visible at the design language" is a general design discipline that extends beyond caching — see also: rate limits, context windows, model selection. Each of these is a platform-level boundary that user code should be designed against.
References
- [1]Anthropic (2025), Prompt Caching Documentation, docs.anthropic.com/claude/docs/prompt-caching.
- [2]Anthropic (2024), Prompt Caching beta announcement, August 2024.
- [3]OpenAI (2024), Prompt Caching for Reduced Latency, platform.openai.com/docs/guides/prompt-caching.
- [4]Google DeepMind (2025), Gemini Context Caching API, ai.google.dev/gemini-api/docs/caching.
- [5]Pope R., Douglas S., Chowdhery A., Devlin J., Bradbury J., Levskaya A., Heek J., Xiao K., Agrawal S., Dean J. (2023), Efficiently Scaling Transformer Inference, MLSys.
- [6]Liu N. et al. (2024), Lost in the Middle: How Language Models Use Long Contexts, TACL.
- [7]Kwon W. et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM), SOSP.
- [8]Zheng L. et al. (2024), SGLang: Efficient Execution of Structured Language Model Programs, NeurIPS.
- [9]Jiang H. et al. (2023), LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, EMNLP.
- [10]Karpathy A. (2024), autoresearch: a self-paced strategic loop, personal blog.
- [11]Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, ICML.
- [12]LangChain (2024), LangGraph: Stateful Multi-Actor Applications, langchain.com/langgraph.
- [13]Inngest (2024), Scheduled Functions Documentation.
- [14]Temporal (2024), Cron Workflows.
- [15]Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025. open ↗
- [16]Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning, arXiv:2604.02460. open ↗
- [17]Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
- [18]Anthropic (2025), Claude Sonnet 4.5 Technical Report.
- [19]Anthropic (2025), Building Agents Cookbook.
- [20]Schick T. et al. (2023), Toolformer, NeurIPS.
- [21]Madani Lab (2026), autoresearch-madani skill v1.0.
- [22]Madani Lab (2026), Cache-Aware Loop Cadence Cost Model v1.0 (open spec + Python reference implementation).
- [23]OpenAI (2024), Assistant API Scheduled Invocations Documentation.
- [24]Anthropic (2025), CronCreate Tool Reference.
