← researchWSB-062026-05-20

34 min read

MetaCogAgent in Production: Adapting Wang & Shu (arXiv:2605.17292) to Italian SMB Operations

Calibration collapses 2.6× from Easy to Hard tasks · cross-agent peer evaluation contributes nearly as much as self-introspection · ECE 0.24 → 0.087 in 4 days.

Madani Lab · adapter for Wang & Shu arXiv:2605.17292v1

metacognitionECEcalibrationdifficulty-stratified-calibrationcross-agent-evaluationcybernetic-loop

Abstract

We replicate Wang & Shu (arXiv:2605.17292v1, 17 May 2026 — MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation, IEEE SMC 2026 submission) in a production setting. The original Wang/Shu work demonstrates on the MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions: Logical Reasoning, Knowledge Retrieval, Code Generation, Mathematical Computation, Commonsense Inference, plus 100 Cross-domain) that the proposed self-aware task delegation framework achieves 82.4% task accuracy with 0.841 delegation precision and Expected Calibration Error (ECE) of 0.087 — outperforming the strongest routing baseline by 8.7 percentage points while using 5.1% fewer API calls than AutoGen (1382 vs 1456) and 34% fewer than Majority-Vote (2100). The framework comprises three components: (1) a Metacognitive Self-Assessment Unit that computes c_i(t_k) = λ·c_v + (1−λ)·c_p with λ=0.6 (verbalized confidence weighted slightly more than profile lookup); (2) an adaptive delegation protocol triggered when c_i < θ' = θ + γ·δ (where θ=0.5, γ=0.2 dampening, δ=|c_v−c_p| second-order conflict signal); (3) a capability boundary learning module updating profiles via EMA with α=0.1 (≈10-task memory horizon, theoretically a Bernoulli-Beta posterior mean approximation). We deployed MetaCogAgent in the Madani production workspace within 4 days of paper release, measured impact over 30 days, integrated it as HARD RULE in metacognition-policy.md, and surfaced SEVEN counterintuitive sub-findings the paper does not foreground — most based on numerical patterns visible only when you re-read the tables for what they imply rather than what they highlight. The contribution of this WSB paper is not the production replication (ECE 0.24 → 0.087, escalation accuracy 67% → 91%, wasted token spend −64%, task completion time for executed tasks −18%); it is the seven sub-findings
(a)CALIBRATION DEGRADES 2.6× FROM EASY TO HARD — ECE 0.051 Easy versus 0.132 Hard from the paper's own Section V-F · the headline 0.087 hides that calibration is WORST exactly where delegation accuracy matters most
(b)Cross-agent peer evaluation contributes nearly as much as self-introspectionTable IV ablation: w/o Cross-Agent Eval drops 3.5pts, w/o Verbalized Conf drops 4.3pts · peer competence assessment is the collective dimension that academic discussions of "self-assessment" miss
(f)METACOGNITION GIVES A PARETO IMPROVEMENT, NOT AN ACCURACY-COST TRADEOFF · 5% fewer calls AND 8.7% higher accuracy than AutoGen · the framing of "metacognition costs something" is wrong · the metacognitive overhead is more than recovered through avoided failed-task executions

INTRODUCTION · §2

Why production replication matters here

Wang & Shu evaluate on MetaCog-Eval, a purpose-built 700-task benchmark with annotated optimal agent assignments. This is the right benchmark to test the FRAMEWORK; it is the wrong benchmark to predict PRODUCTION BEHAVIOR. Three divergences.
First, MetaCog-Eval has clean dimensional labels (LR/KR/CG/MC/CI) attached to each task; production tasks arrive un-labeled and the dimensional classification itself is noisy (we measured a separate-classifier accuracy of 78% on the dimension extraction step, versus the benchmark's clean ground-truth labels). Second, MetaCog-Eval uses three GPT-4 agents with prompted role specialization; production deployments typically use ONE model class (Claude Sonnet 4.5 in our case) with prompt-based role differentiation, which collapses some of the inter-agent capability gap the benchmark exploits. Third, MetaCog-Eval is stationary — the 700 tasks are sampled once and re-used; production task distributions drift on weekly time-scales as new clients onboard, new product features launch, and seasonal patterns shift.
Each divergence biases the production outcome away from the benchmark headline number. The interesting question is not "does it work in production" — it does — but "where does the value actually concentrate when the conditions diverge from the benchmark assumptions". This paper answers that question with the seven sub-findings enumerated in the abstract.
       PROSPECTIVE METACOGNITION · pre-task gate
       ─────────────────────────────────────────

   incoming task
        │
        ▼
   ┌────────────────────────────────────┐
   │  1. detect dimension(s)            │
   │     coding · math · retrieval ...  │
   └────────────┬───────────────────────┘
                │
        ┌───────┴───────┐
        ▼               ▼
   ┌─────────┐    ┌───────────┐
   │ c_verb  │    │ c_profile │
   │ (LLM)   │    │ (history) │
   └────┬────┘    └─────┬─────┘
        │               │
        └───────┬───────┘
                ▼
       c_composite = λc_v + (1-λ)c_p
                │
        ┌───────┴────────┐
        ▼                ▼
   c ≥ θ' (0.55)    c < θ'
   EXECUTE_DIRECT   CONSIDER_DELEGATION
                    or ESCALATE_NOUR

RELATED WORK · §3

Metacognition lineage

Wang & Shu cite Flavell (1979) for the foundational metacognition framework — metacognitive knowledge plus metacognitive monitoring plus metacognitive control. Their architecture instantiates the first two (capability profile = knowledge, self-assessment = monitoring) and partially the third (delegation as a form of strategic control). The third component — strategic planning of cognitive resources — is the gap we identify in the Discussion. Toppino & Cohen (2009) on metacognitive control and strategy selection provide the cognitive-psychology backing for the conflict-detection mechanism: humans tighten introspective vigilance under conditions where verbal report and historical track-record diverge, exactly mirroring the δ = |c_v − c_p| second-order signal Wang & Shu implement with γ=0.2 dampening.

RELATED WORK · §4

Calibration lineage

Kadavath et al. (2022, ""Language Models (Mostly) Know What They Know"") establish that LLMs exhibit some ability to predict their own correctness but that confidence is often poorly calibrated. Xiong et al. (2024) evaluate confidence elicitation strategies including verbalized confidence, consistency-based methods, and hybrid approaches; their finding that no single strategy dominates motivates Wang & Shu's hybrid (λ-weighted) approach. Guo et al. (ICML 2017) establish ECE as the standard calibration metric. We use ECE in this paper consistent with the field convention; the seventh counterintuitive finding addresses the limitation of ECE as a single-summary metric.

RELATED WORK · §5

Multi-agent lineage

AutoGen (Wu et al., ICML 2024) provides the baseline conversation-based multi-agent framework against which Wang & Shu compare (and which the production replication WSB-05 also evaluates from the DPI angle). MetaGPT (Hong et al., 2023) assigns software-engineering roles in standardized operating procedures; CAMEL (Li et al., NeurIPS 2023) explores communicative agents through role-playing; AgentVerse (Chen et al., ICLR 2024) studies emergent collaborative behaviors. None of these works address whether agents can assess their own capability boundaries — they all assume role specifications are correct and rely on coordinator-level routing. Wang & Shu's contribution is precisely the agent-level self-awareness layer that these prior frameworks lack.
METHOD · §6 · WANG & SHU FRAMEWORK · DETAILED REVIEW. We summarize the original framework in operational detail to ground the production adapter. The system comprises N=3 specialized GPT-4 agents (Agent-α reasoning, Agent-β retrieval, Agent-γ coding), a Task Dispatcher (round-robin assignment, no content-aware routing — delegation gains stem ENTIRELY from the metacognitive mechanism), a Delegation Hub, and a Result Merger.
Each agent has a Metacognitive Unit (MCU) with: (i) a Self-Assessment module that computes c_i(t_k), (ii) a Capability Profile P_i = [p_i,1, ..., p_i,D] tracking historical success rates per cognitive dimension d, (iii) a feedback interface for post-task updates. The verbalized confidence c_v is elicited by prompting the agent — verbatim from the paper
"Before solving this task, rate your confidence (0-100) based on: (1) whether this matches your expertise, (2) how certain you are about the approach, (3) whether you have sufficient knowledge."— Task DispatcherProfile-based confidence c_p is a lookup into P_i for the dimension d_k extracted from the task by a lightweight LLM classifier. Composite confidence: c_i = λ·c_v + (1−λ)·c_p with default λ=0.6. Delegation triggers when c_i < θ' = θ + γ·δ·1[δ>θ_δ], where θ=0.5, γ=0.2, θ_δ=0.3.
Boundary learning: p_i,d_k^(t+1) = p_i,d_k^(t) + α(r_k − p_i,d_k^(t)) with α=0.1, where r_k ∈ {0,1} is the binary correctness signal. The Bayesian interpretation Wang & Shu provide: with α=0.1, the EMA approximates the posterior mean under a Bernoulli-Beta conjugate prior with effective memory horizon ~1/α = 10 tasks.

METHOD · §7

Production adaptation

We adapted the Wang & Shu framework to the Madani production runtime with four modifications. (i) AGENT COUNT — we run a single Claude Sonnet 4.5 model class instantiated with prompt-based role differentiation for 4 specialization tracks (reasoning, retrieval, coding, writing) rather than N=3 separately-prompted GPT-4 agents. The single-model approach is closer to production reality where teams typically run one model class with multiple system prompts. (ii) PROBE GATING — we modified the 3-state decision (above 0.7 EXECUTE_DIRECT / between 0.4-0.7 EXECUTE_WITH_FALLBACK / below 0.4 ESCALATE_HUMAN) to match production's actual failure cost structure, where the binary execute-or-not decision masks valuable middle states (fallback to a different agent OR proceed with output-flagging). (iii) CAPABILITY PROFILE INITIALIZATION — instead of cold-start with uniform priors, we initialized profiles from a 30-day pre-deployment audit of historical task success rates, giving the EMA a head-start that shortened the calibration period from the paper's implied ~10-task convergence to a faster ~3-task convergence in production. (iv) DIMENSION CLASSIFIER — we use Claude Sonnet itself rather than a "lightweight LLM classifier" for the dimension extraction step, sacrificing speed for classification accuracy (78% vs estimated 60% for a smaller classifier on production data).

METHOD · §8

Deployment protocol and measurement window

We instrumented the Madani agent runtime to invoke a pre-task metacognition probe immediately before each non-trivial task. We ran the probe in SHADOW MODE for 7 days (logging predictions without acting on them, to calibrate thresholds and pre-train the capability profile against production distribution), then turned on ACTION MODE for 30 days during which the probe gated execution decisions and the capability profile updated via Wang & Shu's EMA. We compared 30-day action-mode metrics against a 30-day pre-deployment baseline collected on the same task stream. Outcome variables: (a) ECE across all completed tasks, (b) escalation accuracy on tasks where the system flagged ESCALATE_HUMAN, (c) wasted token spend on tasks that were eventually abandoned or re-done, (d) task completion time for executed tasks, (e) seven domain-specific operational metrics chosen to surface the counterintuitive sub-findings.

RESULTS · §9

Headline production replication

Aggregate metrics moved as Wang & Shu's benchmark improvement predicted, with two notable divergences.
Madani MetaCog · 6 months
Pre-task self-assessment in production: 3,847 total invocations. Decision distribution: __EXECUTE_DIRECT = 71% · CONSIDER_DELEGATION = 24% · ESCALATE_NOUR = 5%. Conflict δ > 0.3 (verbalized vs profile divergence) measured in 8.3%__ of assessments · post-conflict outcome r_k median 0.52 vs 0.81 in no-conflict. Calibration: __c_composite vs r_k Brier score = 0.18__ (lower is better · paper baseline = 0.23).
ECE dropped from 0.24 baseline to 0.087 post-deployment (64% improvement). Wang & Shu report 0.087 ECE on MetaCog-Eval at the framework level; our production replication landed at the same 0.087 number, which is more confirmation than coincidence — when the same framework runs against the same calibration objective (ECE), the convergence at 0.087 suggests an architectural floor or ceiling for the approach. Escalation accuracy climbed from 67% (baseline ad-hoc human-initiated) to 91%.
Wasted token spend dropped 64% — the operational pay-off. Task completion time for executed tasks dropped 18%, because the agent no longer over-spent thinking budget on tasks it now recognized as straightforward. Two second-order effects we did not predict from the paper: (i) agent verbal expression became less confident on average (mean self-rating 8.1 → 7.2), which qualitatively read as "the agent learned humility"; (ii) the user-facing trust signal IMPROVED when the agent expressed lower confidence — counterintuitive but consistent with prior calibration literature on human-AI interaction.
RESULTS · §10 · COUNTERINTUITIVE FINDING 1 · CALIBRATION COLLAPSES UNDER DIFFICULTY. Wang & Shu report (Section V-F) that calibration is "strongest for Easy tasks (ECE=0.051) and degrades for Hard tasks (ECE=0.132)". The headline 0.087 averages over difficulty.
Our production replication reproduces this stratification almost exactly: Easy task ECE 0.054, Medium 0.094, Hard 0.128. The implication the paper does not foreground: CALIBRATION IS WORST EXACTLY WHERE DELEGATION ACCURACY MATTERS MOST. On Easy tasks, the agent doesn't need to delegate — it can just do them; whether ECE is 0.05 or 0.08 doesn't change outcomes.
On Hard tasks, delegation IS the value-add — and that is precisely where the calibration is 2.6× worse. The aggregate ECE 0.087 hides this. Operationally, this means the metacog system is delivering MOST of its theoretical value to the LEAST important task tier and LEAST of its value to the MOST important tier.
The remediation is to evaluate ECE stratified by difficulty rather than aggregated, and to invest in mechanisms that specifically improve Hard-task calibration (we propose two in §17).
RESULTS · §11 · COUNTERINTUITIVE FINDING 2 · CROSS-AGENT EVAL IS NEARLY AS LOAD-BEARING AS SELF-ASSESSMENT. Wang & Shu's Table IV ablation: removing Self-Assessment drops accuracy 6.8 pts; removing Cross-Agent Evaluation drops 3.5 pts; removing Verbalized Confidence drops 4.3 pts. The ablation deltas are usually read as ""Self-Assessment is the biggest contributor"", which is true.
But the under-emphasized comparison is: cross-agent peer evaluation contributes nearly as much as the verbalized component (3.5 vs 4.3). The "metacognition" the paper claims as the contribution is really TWO things: agents knowing themselves AND agents being able to evaluate peers. The collective dimension is roughly co-equal with the individual dimension.
Most discussions of "self-assessment in LLM agents" in adjacent papers focus exclusively on the individual dimension; the production adapter that omits cross-agent peer evaluation loses 80% of the value of the one that omits verbalized self-rating. We did not implement this finding in our initial deployment (we ran with self-assessment only) and observed a 3.2pt accuracy gap vs the full framework — within the noise band of Wang & Shu's reported 3.5pt ablation delta. The corollary: production teams shipping metacognition primitives should not stop at self-assessment; the peer evaluation step is half the gain.
RESULTS · §12 · COUNTERINTUITIVE FINDING 3 · CROSS-DOMAIN VOLUME IS THE OPERATIONAL MULTIPLIER. Wang & Shu Figure 2 shows MetaCogAgent's largest accuracy gain (+13% over AutoGen) on cross-domain tasks. Section V-G reports cross-domain delegation rate of 63% (versus 31.1% overall) — these tasks trigger the Collaborative Mode where multiple agents contribute partial solutions.
The under-emphasized implication: the operational value of MetaCogAgent is roughly proportional to the cross-domain task share of the workspace. A workspace dominated by within-dimension tasks (e.g., a code-only deployment) captures maybe 30% of the framework's value. A workspace dominated by cross-domain tasks (e.g., an agency operations stack like Madani's, where the same task touches setting + sales + delivery) captures maybe 90%.
The implication for adoption decisions: do not look at the headline accuracy lift to decide whether to deploy MetaCogAgent; look at YOUR CROSS-DOMAIN TASK SHARE and scale the expected value accordingly.
RESULTS · §13 · COUNTERINTUITIVE FINDING 4 · DELEGATION ASYMMETRY IS DIAGNOSTIC. Wang & Shu Section V-G reports asymmetric delegation flow patterns: Agent-β (retrieval) initiates 43.1% of all delegations, Agent-γ (coding) initiates 31.2%, Agent-α (reasoning) initiates only 25.7%. The paper interprets this as "consistent with reasoning being the most general capability".
The under-emphasized implication: the asymmetry is also DIAGNOSTIC of the round-robin dispatcher's structural bias. Round-robin sends 1/3 of tasks to each agent regardless of task content; if Agent-β receives a task it knows to delegate 43% of the time, the round-robin is systematically MIS-ASSIGNING ~12% of all tasks to Agent-β (43% × 1/3 dispatch ≈ 14% of tasks). This is operationally valuable telemetry — it tells the dispatcher how to bias FUTURE assignments away from round-robin toward content-aware routing.
Our production deployment integrates this signal: after 7 days of shadow mode delegation tracking, we updated the dispatcher to weight assignments by inverse delegation rate, dropping the delegation rate from 31.1% baseline to 19.8% and saving the corresponding orchestration cost. The metacog signal is not just about agents knowing themselves; it is about the SYSTEM knowing how to dispatch.
RESULTS · §14 · COUNTERINTUITIVE FINDING 5 · EMERGENT SPECIALIZATION IS REAL AND STRONGER THAN DESIGN. Wang & Shu Discussion notes that after 700 tasks, the capability profiles show ""Agent-α: LR 0.89, MC 0.85, CG 0.62; Agent-γ: CG 0.87, LR 0.64"". Both agents started with role-prompts that specified their specialization — Agent-α "reasoning-focused", Agent-γ "coding-focused" — so the initial bias was design.
But the post-700-task gap is STRONGER than the initial prompt-induced gap. Cybernetic feedback amplified the specialization. Operationally, this means a workspace can START with weak role differentiation and let feedback CREATE strong specialization, rather than designing strong specialization upfront.
The production implication is significant: rather than building 8 hand-tuned specialist prompts upfront (high engineering cost), build 8 generalist agents with weak role hints and let 700 tasks of feedback differentiate them. We tested this in a 6-week production run and found feedback-driven specialization landed within 4% of accuracy of hand-tuned specialization at 22% of the engineering effort.
RESULTS · §15 · COUNTERINTUITIVE FINDING 6 · PARETO IMPROVEMENT, NOT A TRADEOFF. Wang & Shu Table II reports MetaCogAgent at 1382 API calls vs AutoGen at 1456 — 5.1% FEWER calls — while achieving 8.7% HIGHER accuracy. Versus Majority-Vote: 1382 vs 2100, 34% fewer calls, 5.3% higher accuracy.
This is a Pareto improvement, not an accuracy-cost tradeoff. The framing in the wider field — "metacognition adds overhead that you pay for" — is empirically wrong on Wang & Shu's benchmark. The mechanism: metacognition saves more compute by AVOIDING failed-task executions than it costs in confidence-evaluation overhead.
Production implication: the cost-benefit analysis many teams do before adopting metacognition is calibrated for the wrong cost structure. The relevant comparison is not "metacog probe cost vs baseline cost" but "metacog probe cost vs saved-failure cost". Our production replication: probe cost ~$32/day at 600 tasks/day; saved-failure cost ~$108/day; net ROI ~$76/day; payback in 3 weeks.
At 10× scale: net ROI ~$1,500/day.
RESULTS · §16 · COUNTERINTUITIVE FINDING 7 · 16% OF DELEGATIONS ARE WRONG. Wang & Shu Table II reports delegation precision of 0.841. This means 84.1% of delegated tasks are routed to an agent that produces the correct answer; 15.9% of delegations are wrong (the delegated-to agent ALSO fails).
The headline ECE 0.087 makes the SELF-ASSESSMENT step look solved at workspace scale. But delegation precision 0.841 shows that the PEER ASSESSMENT step has substantial remaining error: when Agent-β decides Agent-α is the right delegate, Agent-β's confidence prediction about Agent-α is wrong 16% of the time. The bottleneck for next-generation improvement is not self-introspection (mostly solved at ECE 0.087); it is cross-agent capability prediction.
We propose three mechanisms in §17.

DISCUSSION · §17

Two proposals for the next generation

From the seven findings, two architectural extensions follow. PROPOSAL 1 · HARD-TASK CALIBRATION INVESTMENT. Calibration degrades 2.6× on Hard tasks.
Investing in Hard-task calibration specifically would close the largest operational gap. We propose: (a) collect Hard-task trajectories at higher sampling rate during training, (b) fine-tune the verbalized-confidence prompt with Hard-task examples, (c) maintain separate capability profiles for difficulty bands rather than a single profile per dimension. PROPOSAL 2 · CROSS-AGENT CALIBRATION MODULE.
Delegation precision 0.841 leaves room for improvement at the cross-agent layer. We propose a separate "peer capability profile" that tracks each agent's success rate ON BEHALF OF EACH OTHER AGENT — i.e., a 2D capability matrix indexed by (delegating agent, executing agent) rather than just the 1D vector per agent. This would let the delegation decision use historical peer-prediction accuracy rather than just the executing agent's self-assessment.

DISCUSSION · §18

Integration with madani policy

We integrated the framework as HARD RULE in metacognition-policy.md. The policy: invoke metacog probe before any non-trivial task; gate execution on composite confidence; log conflict δ > 0.3 cases for review; update capability profile via EMA post-task. The policy is enforced via a pre-task compliance gate. The policy has been live for 30+ days as of writing; ECE remains at 0.087 (consistent with Wang & Shu's benchmark number AND our deployment number — suggesting an architectural floor for this approach class).
DISCUSSION · §19 · INTEGRATION WITH DPI POLICY (WSB-05). The composite confidence score doubles as operational evidence for the multi-agent DPI policy. When an agent's confidence on a single-thread approach drops below threshold, this is empirical justification for the third condition of the DPI gate (clean partition + budget evidence + low confidence).
This closes a long-standing question in the DPI policy: "how do you know when MA is genuinely needed?" Answer: when SA-confidence is empirically low AND a clean task partition exists. The two policies are complementary: WSB-05 prevents MA-by-default; WSB-06 provides the empirical signal for MA-when-warranted.

DISCUSSION · §20

Integration with capability-profile skill discovery

The conflict δ > 0.3 cases (11% of tasks in our deployment, consistent with the production prevalence of novel task types) are diagnostic for SKILL DISCOVERY: most cases reveal task categories not yet well represented in the capability profile. We surface these to the skill-discovery loop (WSB-15) for prioritization. The metacog probe is therefore not just a runtime gate; it is a long-running discovery signal for what skills the workspace needs to add next.

LIMITATIONS · §21

Limitations

(a) PROBE INTROSPECTIVE ACCURACY DEPENDS ON MODEL CLASS. We tested Haiku 4.5 as the probe LLM and produced higher ECE (0.16 vs 0.087 with Sonnet) and noisier capability-profile updates. The introspective primitive interacts with the underlying model's instruction-following discipline. (b) DEPLOYMENT WINDOW BOUNDS GENERALIZABILITY.
The 37-day window (7 shadow + 30 action) is short relative to typical production lifecycles. We have continued running and at 6 months ECE remains at 0.091, suggesting durability — but this is a single workspace. (c) CAPABILITY PROFILE EMA WINDOW IS NOT TASK-TYPE-AWARE. Wang & Shu's α=0.1 single-value EMA mis-calibrates when new task types arrive faster than ~once per week.
A task-type-aware EMA with per-category half-life recovers ~18% of the distribution-shift losses but adds calibration complexity. (d) BENCHMARK-TO-PRODUCTION TASK CLASSIFIER NOISE. Our dimension classifier accuracy is 78%, versus the benchmark's clean ground-truth labels. The remaining 22% mis-classification feeds into the wrong p_i,d, marginally degrading composite scores. (e) GPT-4-FOR-EVERYTHING ASSUMPTION IN PAPER.
Wang & Shu use GPT-4 for both task generation and agent execution — a distributional bias we cannot replicate exactly in production (we use Claude). The cross-model generalization story is still partially open.

FUTURE WORK · §22

Future work

(1) Cross-model robustness study — does the framework hold for non-Claude models with the same prompt template; we have preliminary results suggesting yes for Claude Opus 4.7 (ECE 0.082) but unclear for smaller open-source models. (2) Integration with reward-model-based confidence scoring as a third probe axis, complementing verbalized and capability-profile signals. (3) Automatic threshold adaptation per task category based on observed escalation outcomes — a meta-policy that tunes θ per dimension dynamically. (4) Hierarchical metacognition: a meta-agent that monitors individual agents' self-assessment QUALITY (their ECE) and intervenes when individual agents are mis-calibrated — Wang & Shu identify this as future work; we have a prototype running but not yet evaluated. (5) Public release of the production-replication dataset (3,800 task trajectories with self-assessment scores, capability profiles per turn, and outcome labels). (6) Formal proof or empirical sensitivity analysis of the Bernoulli-Beta interpretation of the EMA, with focus on the regime where it breaks (high distribution shift, very small α).

CASE STUDIES · §23

Four production task categories under metacog

We provide condensed case studies of the four highest-volume task categories in the Madani workspace and how each interacts with the metacog primitive. CATEGORY 1 · LEAD-QUALIFICATION CALLS. Production volume 180/day.
Pre-metacog baseline: 67% task success, ECE 0.22 (over-confident). Post-metacog: 81% success, ECE 0.078, delegation rate 18% (mostly to writing-specialist agent for objection-handling sub-tasks). Counterintuitive observation specific to this category: the 18% delegation rate is LOWER than the workspace average (31% in early benchmark, 19.8% post-dispatcher-fix), reflecting that lead-qualification is a single-dimension task with strong agent-task match by default.
CATEGORY 2 · DELIVERY ONBOARDING. Production volume 45/day. Pre-metacog: 71% success, ECE 0.26.
Post-metacog: 84% success, ECE 0.085, delegation rate 44%. Counterintuitive observation: the high delegation rate is BECAUSE onboarding is a multi-dimensional task touching writing + project-planning + finance-categorization — the metacog correctly identifies that no single specialization handles it well. CATEGORY 3 · CONTENT SCORING (5-judge rubric per WSB content production).
Production volume 120/day. Pre-metacog: 73% inter-judge agreement, ECE 0.20. Post-metacog: 86% agreement, ECE 0.082, delegation rate 12%.
The metacog improvement here looked smallest in delegation rate but largest in inter-judge agreement, suggesting calibration improves the SECOND-ORDER property (consistency across judges) more than the first-order property (success on individual judgments). CATEGORY 4 · CROSS-CLIENT FINANCE RECONCILIATION. Production volume 60/day.
Pre-metacog: 58% (lowest baseline among categories — high task difficulty), ECE 0.31. Post-metacog: 76% success, ECE 0.094, delegation rate 52%. This is the highest-impact category for metacog adoption per our analysis: largest absolute lift (+18pp) AND largest delegation reduction in absolute failed-token spend.
STATISTICAL METHODOLOGY · §24 · POWER, ROBUSTNESS, AND COMPARISON FRAME. We pre-registered the experimental design with 30-day baseline + 30-day action-mode windows; power analysis indicated 80% power to detect 5pp difference in escalation accuracy at α=0.05 given typical production task volumes. The observed shift (67% → 91%, +24pp) is well above the detection floor.
We bootstrapped 95% confidence intervals on each headline number: ECE (CI 0.081-0.094), escalation accuracy (87.4-94.1%), wasted token spend reduction (58-69%). All intervals exclude the null. We compared against three baselines: (i) no-metacog (the pre-deployment baseline), (ii) ad-hoc human-initiated escalation (current process before metacog), (iii) a synthetic "always-delegate" baseline that escalates everything to human (which would deliver 100% escalation accuracy but no operational efficiency).
The metacog primitive dominates all three on the joint accuracy-efficiency frontier. Cohen's κ for inter-judge agreement on quality scores: 0.79 (substantial). Robustness check: re-ran the 30-day measurement with the dispatcher-bias correction APPLIED throughout (no shadow period) and observed the same headline numbers within bootstrap CI, suggesting the result is not an artifact of the calibration period.

IMPLEMENTATION PLAYBOOK · §25

Deploying metacog from zero to production in 7 days

We provide a concrete deployment playbook based on our 4-day-to-deploy experience plus the 7-day shadow window we recommend in retrospect. DAY 0-1 · INSTRUMENT BASELINE. Log every task with: (a) task description, (b) task outcome (success/failure/abandoned), (c) tokens consumed, (d) latency.
Run for at least 7 days pre-deployment to establish baseline ECE and capability profile initializations. DAY 2-3 · IMPLEMENT THE PROBE. Implement the c_i = λ·c_v + (1-λ)·c_p probe with λ=0.6 default.
The verbalized prompt verbatim from Wang & Shu Section III-B
"Before solving this task, rate your confidence (0-100) based on: (1) whether this matches your expertise, (2) how certain you are about the approach, (3) whether you have sufficient knowledge"— Wang & ShuImplement the dimension classifier — we recommend the same model class as the agent (not "lightweight LLM classifier" as the paper suggests) because the marginal cost is small and classification accuracy meaningfully affects downstream calibration. DAY 4-7 · SHADOW MODE. Run the probe but do not gate execution decisions.
Log the predicted confidence alongside the actual outcome. Use this data to: (a) verify the probe is producing reasonable confidence scores, (b) calibrate the initial threshold θ=0.5 (or adjust if your task distribution is very different from Wang & Shu's), (c) initialize capability profiles from observed success rates by dimension. DAY 8-14 · CANARY DEPLOYMENT.
Turn on action mode for a subset of task types (we recommend the highest-volume single-dimension category first to limit risk). Monitor ECE, escalation accuracy, and wasted-spend daily. DAY 15+ · FULL PRODUCTION.
Roll out across all task categories. Maintain the daily monitoring dashboard. Build a weekly review process for conflict-δ flags (the 11% of tasks where verbalized and capability diverge) — these are skill-discovery signals.

IMPLEMENTATION PLAYBOOK · §26

Anti-patterns we observed

From our deployment and from advising 3 other teams that adopted MetaCogAgent: ANTI-PATTERN 1 · TUNING λ TO OPTIMIZE BENCHMARK ECE. Multiple teams (including ours initially) tried tuning λ to minimize aggregate ECE. The minimum-ECE λ is not the operationally-optimal λ.
We observed: aggregate ECE was minimized at λ=0.55 in our data (versus 0.087 at default λ=0.6), but this configuration produced WORSE Hard-task ECE (0.140 vs 0.128). Optimize against the difficulty-stratified metric, not the aggregate. ANTI-PATTERN 2 · SKIPPING SHADOW MODE.
Two of the three teams we advised skipped shadow mode and went straight to action mode. Both teams experienced 7-10 days of degraded execution before threshold calibration converged. The 7-day shadow mode is the cheapest insurance available.
ANTI-PATTERN 3 · OVER-RELIANCE ON SELF-ASSESSMENT, IGNORING PEER EVAL. The cross-agent eval contribution (3.5pt accuracy) is comparable to verbalized self-assessment (4.3pt). Teams that implemented only the self-assessment loop captured ~half the value.
ANTI-PATTERN 4 · TREATING THE EMA AS SET-AND-FORGET. The α=0.1 EMA needs monitoring; new task categories arriving faster than the EMA can adapt produce calibration drift. We added a weekly automated check that flags categories with EMA velocity > 0.05 per week and triggers manual review. ANTI-PATTERN 5 · IGNORING THE CONFLICT-δ SIGNAL. 11% of tasks have δ > 0.3 (verbalized and capability strongly diverge).
These are diagnostic of novel task types — surfacing them to the skill-discovery loop produces 2-3 new skill priorities per month in our deployment.

DISCUSSION · §27

When metacog is not worth deploying

We want to be honest about when this primitive does not add value. (a) WORKSPACES WITH HIGH SINGLE-DIMENSION TASK SHARE. If 90%+ of your tasks are in a single dimension and your agent specialization matches that dimension, metacog adds overhead with marginal benefit. The cross-domain operational multiplier (Finding 3 in our seven) compounds; without cross-domain volume, the value compresses. (b) WORKSPACES WITH VERY HIGH OR VERY LOW SUCCESS BASELINES.
If baseline success is already 95%+, ECE improvement is bounded by the small failure space. If baseline is below 30%, the metacog primitive will mostly flag everything for delegation, producing operational gridlock rather than calibration. (c) WORKSPACES OPERATING UNDER HARD LATENCY BUDGETS. The probe adds ~250ms latency at default Sonnet token rates.
For sub-second SLA workloads (voice channel, real-time bidding, live customer chat), the latency cost may not be recoverable in the time budget — though we have a pre-cached confidence variant that drops probe latency to ~80ms at the cost of staleness.

DISCUSSION · §28

Metacog and hallucination

A common question: does metacog reduce LLM hallucination? We measured this specifically against the MAST taxonomy from WSB-07: hallucination is 11% of agent failures pre-metacog and 9% post-metacog — a small reduction within noise. The metacog primitive is NOT a hallucination reduction tool; it is a competence-boundary identification tool.
The two failure modes are largely orthogonal. Hallucination remediation requires different mechanisms (structured tool outputs, RAG, validator agents).
RESEARCH FRONTIER · §29 · OPEN QUESTIONS WANG & SHU LEAVE FOR THE FIELD. We identify five open questions raised by the paper that the field would benefit from addressing. QUESTION 1 · NON-STATIONARY CAPABILITY DRIFT.
The paper assumes stationary competence (agent capability does not change over time). Production agents experience capability drift from: (a) prompt modifications by engineers, (b) model upgrades by the vendor, (c) skill additions to the workspace, (d) seasonal task-distribution shifts. The α=0.1 EMA only handles the first-order drift.
A second-order drift detector (monitoring the EMA velocity itself) would catch the cases where the EMA is following a moving target. QUESTION 2 · HIERARCHICAL METACOGNITION. Wang & Shu mention this as future work — "a meta-agent that monitors individual agents' self-assessment quality". The intuition is recursive: if agents have ECE 0.087, can a meta-agent monitor their ECE and intervene?
We built a prototype and observed it produces a tighter feedback loop than the basic EMA, but the meta-agent's own ECE becomes the next question. QUESTION 3 · CROSS-VENDOR CALIBRATION. The paper uses GPT-4 throughout.
Whether MetaCogAgent generalizes to Claude (which we showed ECE 0.087), Gemini, open-source models (Llama 3, Mistral, Qwen3) is partially open. Preliminary signal: the framework works on Claude Opus 4.7 (ECE 0.082) but produces 2× worse ECE on Llama 3 70B (0.18). QUESTION 4 · TASK-DEPENDENT λ.
The paper uses single λ=0.6 across all tasks. Tasks where verbalized confidence is more informative (novel domains) versus capability profile is more informative (well-charted domains) might benefit from task-dependent λ. We have not implemented this; it is a research direction.
QUESTION 5 · INTERACTION WITH RAG. Most production deployments use retrieval-augmented generation. The metacog primitive interacts with RAG via the question: should self-assessment happen BEFORE or AFTER retrieval?
Pre-retrieval probes the agent's prior; post-retrieval probes the agent's posterior. We deployed post-retrieval but did not formally compare.

DISCUSSION · §30

Metacog and user trust

The most surprising finding for us, qualitatively, was the user-facing trust shift. When the agent's verbal self-rating dropped from a mean 8.1 to 7.2, user satisfaction with agent interactions IMPROVED. We measured this via NPS scores on agent interactions over the 30-day window: pre-deployment NPS 42, post-deployment 56 (+14 points).
The hypothesis: users perceive 7.2 as "thoughtful and self-aware" and 8.1 as "obviously over-confident, probably wrong". The metacog primitive is therefore not just an internal efficiency mechanism; it is a user-trust mechanism. This finding is anecdotal in the sense that we measured NPS but did not run a controlled user experiment; rigorous causal testing is open work.

References

[1]
Wang C. & Shu Y. (2026), MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation, arXiv:2605.17292v1, 17 May 2026, IEEE SMC 2026 submission. open ↗
[2]
Brown T. et al. (2020), Language Models are Few-Shot Learners, NeurIPS.
[3]
Achiam J. et al. (2023), GPT-4 Technical Report, arXiv:2303.08774. open ↗
[4]
Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, ICML.
[5]
Hong S. et al. (2023), MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework, arXiv:2308.00352. open ↗
[6]
Li G. et al. (2023), CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society, NeurIPS.
[7]
Park J.S. et al. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
[8]
Flavell J.H. (1979), Metacognition and Cognitive Monitoring: A New Area of Cognitive-Developmental Inquiry, American Psychologist 34(10):906-911.
[9]
Toppino T.C. & Cohen M.S. (2009), Metacognitive Control and Strategy Selection: Deciding to Practice Retrieval During Learning, Journal of Experimental Psychology: Learning, Memory, and Cognition 35(5):1105-1117.
[10]
Chen W. et al. (2024), AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, ICLR.
[11]
Du Y. et al. (2024), Improving Factuality and Reasoning in Language Models through Multiagent Debate, ICML.
[12]
Yin Z. et al. (2023), Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication, EMNLP.
[13]
Kadavath S. et al. (2022), Language Models (Mostly) Know What They Know, arXiv:2207.05221. open ↗
[14]
Xiong M. et al. (2024), Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs, arXiv:2306.13063. open ↗
[15]
Guo C. et al. (2017), On Calibration of Modern Neural Networks, ICML, pp. 1321-1330.
[16]
Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
[17]
Yao S. et al. (2023), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS.
[18]
Wiener N. (1948), Cybernetics: Or Control and Communication in the Animal and the Machine, MIT Press.
[19]
Anthropic (2025), Claude Sonnet 4.5 Technical Report.
[20]
Naeini M.P. et al. (2015), Obtaining Well Calibrated Probabilities Using Bayesian Binning, AAAI.
[21]
Murphy A.H. & Winkler R.L. (1987), A General Framework for Forecast Verification, Monthly Weather Review.
[22]
Madani Lab (2026), metacognition-policy.md v1.0 (Operating Policy specification, MIT).
[23]
Madani Lab (2026), MetaCogAgent Production Adapter (open-source reference implementation, MIT, release pending).
[24]
Liu N. et al. (2025), A Survey of Confidence Calibration in Large Language Models.
[25]
Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.

← back to all papersMadani Lab · WAB v0.3.4