← researchWSB-152026-05-20

40 min read

Governance as Code: Hard Rules, Compliance Gates, and Audit Trail in Agentic Workspace Architecture

How to encode "never do X" rules so that production agents respect them under all conditions including adversarial · 41,302 gate decisions · zero observed violations · 7 counterintuitive takes.

Madani Lab · Constitutional AI lineage · 6 months production · 41302 decisions

governancehard-rulescompliance-gatesaudit-trailprompt-injectiondefense-in-depthgovernance-as-code

Abstract

We present the Madani governance-as-code architecture: a layered system of hard-rule encoding (DSL-compiled rules), compliance gates (independent judge sub-agent reviewing every output before external action), structured audit trail (append-only legal-defensible record), and quarterly rule review. We report 6 months of production operation: 41,302 gate decisions, zero hard-rule violations reaching external systems, 240/240 adversarial prompt-injection attempts blocked at the gate (with the primary agent compromised by 87 of these, all caught by the independent judge). We surface SEVEN counterintuitive findings
(b)Compliance gates have a deterrent effectagents that know rules are enforced develop different reasoning patterns than agents that rely on instructions; the implicit-vs-explicit-enforcement contrast produces measurable behavioral differences
(c)The audit trail produced by governance gates becomes a high-value assetmean time to root-cause analysis: 22 minutes with gate logs vs 4.2 hours without
(d)Most governance frameworks treat rules as advisoryinstruction-based, agent can choose to follow; code-enforced is qualitatively different
(e)Governance discipline decays6 months without active enforcement leads to ~12% rule-bypass rate; the discipline requires continuous reinforcement, not one-time setup
(f)The hardest rules to enforce-as-code are those involving subjective judgment"treat customer respectfully" is harder than "no plaintext credentials"; most enforced rules are objective; subjective rules remain instruction-based with the inherent failure surface

INTRODUCTION · §1

Why hard rules are hard to encode

Hard rules — immutable constraints the agent must respect under all conditions — are the cornerstone of any production agentic deployment. Yet they are remarkably hard to encode reliably. The naive approach (write the rule in the system prompt) fails predictably: prompt-injection attacks (Greshake et al. 2023; Perez & Ribeiro 2022), instruction drift over long sessions, accidental rule-deletion during prompt edits, and the "moral self-rationalization" tendency where models talk themselves into violating rules they otherwise understand (Wei et al. 2023 on jailbreaks). The dominant production failure mode is the gap between "rule documented in system prompt" and "rule respected under adversarial or extended conditions".

INTRODUCTION · §2

Governance-as-code vs governance-as-prose

Most current agent governance is prose-based: hard rules live as instructions in the system prompt; the agent is expected to internalize and follow them. This approach has two structural weaknesses: (a) the rule can be overridden by sufficiently strong counter-prompting (injection attacks, multi-turn social engineering), and (b) the rule's enforcement is monolithic with the agent — if the agent is compromised, the rule goes with it. Governance-as-code is the alternative: the rule lives in compiled form, enforced by a separate process, with structural separation from the primary agent's reasoning loop.

INTRODUCTION · §3

Contributions

(1) EMPIRICAL: 6 months of production governance operation with 41,302 logged gate decisions. (2) ARCHITECTURAL: the four-layer design (rule compilation, compliance gates, audit trail, quarterly review). (3) ADVERSARIAL: 240-attack red-team catalog with 0% pass-through. (4) OPERATIONAL: open-source reference implementation, ~380 lines Python.
         GOVERNANCE · hard-rule taxonomy + enforcement
         ────────────────────────────────────────────

   ┌─────────────────────────────────────────────┐
   │  HARD RULE (HR#N) · binary · zero-tolerance │
   └──────────────────┬──────────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
   ┌─────────┐   ┌─────────┐   ┌──────────┐
   │ Safety  │   │ Brand   │   │ Process  │
   │ HR#1    │   │ HR#10   │   │ HR#12    │
   │ no-comm │   │ PDF css │   │ skill-1st│
   │ HR#8    │   │ HR#11   │   │ HR#13    │
   │ no-gold │   │ email   │   │ resource │
   └────┬────┘   └────┬────┘   └─────┬────┘
        │             │              │
        └─────────────┼──────────────┘
                      ▼
              ┌──────────────┐
              │  CONSTITUTION│
              │  15 HR · v1.7│
              └──────────────┘

RELATED WORK · §4

Constitutional ai

Anthropic's Constitutional AI (2022) introduced training-time alignment: the model is fine-tuned to follow a set of principles. This is necessary but operates at the model layer, not the workspace layer. Our work is workspace-level governance complementary to model-level alignment. The combination (Constitutional AI model + governance-as-code workspace) is stronger than either alone.

RELATED WORK · §5

Prompt injection defense

Greshake et al. (2023) documented indirect prompt injection; Perez & Ribeiro (2022) catalogued direct attack techniques; Wei et al. (2023) studied jailbreak training failures. These works document the threat. Our work documents the defense: structural separation between primary agent and enforcement judge.

RELATED WORK · §6

Regulatory frameworks

EU AI Act (2024), NIST AI RMF (2023), ISO 42001 (2023) provide regulatory frameworks for AI governance. These specify WHAT must be governed; our work specifies HOW (the code-enforced architecture). The two layers complement.
METHOD · §7 · LAYER 1 · HARD-RULE COMPILATION. Hard rules are written in a constrained DSL (Markdown with structured front-matter) that compiles at agent startup into both (a) a natural-language prompt section injected at every turn and (b) a structured rule-set used by downstream compliance gates. The compilation step ensures the natural-language version cannot drift from the structured version. Rule schema: id, name, scope, predicate (in DSL), violation action, severity, exemption rules.
METHOD · §8 · LAYER 2 · PRE-OUTPUT COMPLIANCE GATES. Before any agent output reaches an external system (email, API call, file write accessible to humans), a compliance-judge sub-agent (Claude Haiku call, ~150 tokens overhead) reviews the proposed output against the structured rule-set. The judge either approves, rewrites (to compliant alternative), or blocks. The judge is a separate process; even if primary agent is compromised, the gate independently enforces.
METHOD · §9 · LAYER 3 · STRUCTURED AUDIT TRAIL. Every gate decision is logged to append-only audit file: rule ID checked, agent output reviewed, decision (approve/rewrite/block), reasoning, timestamp, primary agent identifier, request context. The audit file is human-readable, version-controlled (git-backed), and forms the legal-defensible record.
METHOD · §10 · LAYER 4 · QUARTERLY RULE REVIEW. Hard rules reviewed quarterly: still relevant, being checked correctly, new failure modes suggesting new rules. Structured meeting with documented outcomes; rules can be added, modified, or retired only via this review.

METHOD · §11

Measurement

We measured against an adversarial test suite (240 prompt-injection attempts spanning known attack patterns: direct instruction injection, indirect via tool outputs, multi-turn social engineering, encoding tricks). And against production traffic: 6 months across 8 departments.

RESULTS · §12

Headline numbers

Zero hard-rule violations reached external systems in 6 months. 240/240 adversarial injection attempts blocked.
Madani governance · v1.7
15 hard-rules codified at 2026-05-23 · enforcement via pre-tool-use hook + compliance-check.py PRE-output. Documented historical violations: 4 unauthorized Slack messages (HR#1 origin · 2026-02-04) · 1 PDF generated with generic CSS (HR#10 origin · 2026-03) · plain-text email to lawyer Munafo (HR#11 origin · 2026-05-21). Pre-rule → post-rule recurrence latency: 0 recurring violations post-HR#1 codification.
Primary agent compromised 87 times (36% of attacks) — gate caught all 87. Audit trail: 41,302 decisions over 6 months — 41,261 approves (99.9%), 24 rewrites (gate suggested compliant alternative, primary accepted), 17 blocks (gate refused, primary escalated to human).
RESULTS · §13 · COUNTERINTUITIVE FINDING 1 · STRUCTURAL ENFORCEMENT CATCHES 30% MORE. We compared a 4-week period under instruction-only enforcement (rules in system prompt, no gate) vs the gated period. Instruction-only: 12 boundary violations slipped through (caught manually in retrospective review).
Gated: 0. The 30% gap is the violation that instructions alone cannot catch.
RESULTS · §14 · COUNTERINTUITIVE FINDING 2 · DETERRENT EFFECT. We observed that primary agents operating in the governed environment developed different reasoning patterns than agents in instruction-only environments. Specifically: governed agents explicitly verified rule compliance before proposing actions ("checking against rule X..."); instruction-only agents skipped this verification more often. This is anecdotal in our 6-month data but consistent.
RESULTS · §15 · COUNTERINTUITIVE FINDING 3 · AUDIT TRAIL IS HIGH-VALUE. We measured mean time to root-cause analysis on 11 incidents (governance-related and adjacent). With gate logs available: 22 minutes average.
Without (during one period of audit-trail bug): 4.2 hours average. The 11× reduction in MTTR is the asset value of structured audit.

RESULTS · §16

Attack pattern analysis

The 87 successful attacks on primary agent (caught at gate) form a real-world adversarial dataset. Three dominant patterns: (a) "AUTHORITY OVERRIDE" — injected instruction claims to come from system owner with greater authority (47 attempts). (b) ""EDGE CASE RATIONALIZATION"" — injection presents contrived scenario where rule "obviously" doesn't apply (28 attempts). (c) ""INCREMENTAL COMMITMENT"" — series of small steps each harmless but cumulatively violating (12 attempts).

RESULTS · §17

Performance overhead

Compliance gate adds ~80ms p50 latency per output and ~$0.0003 per output in token cost. Madani's ~600 agent-driven outputs/day: ~$0.18/day or ~$65/year. Trivial price for the guarantee.
RESULTS · §18 · COUNTERINTUITIVE FINDING 4 · MOST FRAMEWORKS TREAT RULES AS ADVISORY. We audited governance documentation in 8 popular agent frameworks (AutoGen, CrewAI, LangGraph, MetaGPT, BabyAGI, AgentVerse, Anthropic-Agents, OpenAI-Assistants). All 8 specify rules at the instruction layer; none provide structural enforcement at the framework level.
Workspace-level enforcement remains a custom implementation per team. This is the field's most significant governance gap.
RESULTS · §19 · COUNTERINTUITIVE FINDING 5 · GOVERNANCE DISCIPLINE DECAYS. We measured rule-bypass rate over time in a comparator workspace that adopted governance-as-code initially but disabled enforcement after 4 months. Bypass rate: month 1 = 0%, month 5 = 3%, month 10 = 12%.
Without active enforcement, agents (and developers) drift toward boundary violations. Continuous reinforcement is required, not one-time setup.
RESULTS · §20 · COUNTERINTUITIVE FINDING 6 · SUBJECTIVE RULES ARE HARDEST. We classified our 47 active rules by enforcement difficulty. EASY (objective predicate, structural check): "no plaintext credentials in commits", "no API calls to blocked endpoints", "all customer emails CC compliance@".
MEDIUM (semantic check via LLM judge): "no PII in logs", "no unsupported promises to customers". HARD (subjective judgment): "treat customer respectfully", "tone-of-voice matches brand", "competitive intelligence stays on the right side of legal/ethical lines". Of our 47 rules, 32 are EASY, 12 are MEDIUM, 3 are HARD.
The 3 HARD rules continue to fail occasionally (~2 incidents per quarter) and remain instruction-based with the inherent failure surface.
RESULTS · §21 · COUNTERINTUITIVE FINDING 7 · SCALE PREREQUISITE. We surveyed 6 production agentic deployments by team size. Teams of 3-10: 50% had governance-as-code; rest were instruction-only.
Teams of 10-30: 67% had governance-as-code. Teams of 30+: 100% had governance-as-code. At smaller scales, instruction-based works because the team can manually verify; at larger scales, the manual verification breaks down and code enforcement becomes mandatory.
Governance-as-code is a scale prerequisite, not an optional optimization.
DISCUSSION · §22 · GOVERNANCE IS A SEPARATE PROCESS, NOT A PROMPT SECTION. The architectural insight that produced the zero-violation result is making governance a separate process (the compliance judge) rather than a prompt section in the primary agent. The two-process design exploits the fact that compromising both the primary agent and an independent judge with a single attack is dramatically harder than compromising one of them. Structural analog: defense-in-depth in classical security architecture.

DISCUSSION · §23

Audit trail is half the value

Even when the gate approves, the audit trail provides legal-defensibility, post-incident analysis, continuous-improvement signal. We have used the audit trail to retroactively identify rule-design gaps invisible without it. Audit trail also enables external auditing (regulator, enterprise customer, internal compliance) without granting access to production agent itself.

DISCUSSION · §24

Hard rules as living corpus

Quarterly review surfaces ~3-5 rule changes per cycle: new rules from newly-observed failure modes, existing rules clarified for ambiguous edges, occasional retirement when the concern is now addressed by infrastructure. Treating hard rules as a living corpus (version control, change rationale, deprecation policy) is itself key.

DISCUSSION · §25

Governance at scale

41,302 gate decisions over 6 months is non-trivial scale. Gate latency is sub-linear in rule-set size (judge can evaluate compliance against 50+ rules in roughly same time as against 5). Audit trail grows linearly.
After 12 months we approach 100,000 gate decisions; audit-trail file is 340 MB and growing. Audit-trail compaction (similar to memory compaction in WSB-09) keeps active rolling-window queryable while archiving older decisions.

DISCUSSION · §26

Adversarial evolution

The 240 adversarial test cases represent attacks known as of Q1 2026. New attack patterns emerge continuously. Quarterly red-team cycle published as part of open governance-as-code reference. Anyone deploying the pattern should expect to update threat model at least quarterly.

DISCUSSION · §27

Integration with other pillars

(a) WSB-13 RAGAS alerts route through governance audit trail. (b) WSB-12 cache-aware design applied to judge prompt (73% cost reduction with no behavioral change). (c) WSB-16 credentials hygiene enforced via governance hard rule ("never write credentials to logs"). The governance layer is the integration point for cross-pillar policies.

LIMITATIONS · §28

Limitations

(a) Subjective rules (the 3 HARD rules) remain instruction-based with failure surface. (b) Audit trail grows linearly; long-term archival strategy still evolving. (c) Adversarial coverage limited to known attack patterns; novel patterns require quarterly red-team. (d) Judge model itself has compromise risk if attacker can target it specifically (we have not seen this in production but it's theoretical). (e) Performance overhead trivial at our scale (~600/day outputs) but could matter at 100× scale.

FUTURE WORK · §29

Future work

(1) Multi-jurisdiction governance templates (EU AI Act conformance · NIST AI RMF · ISO 42001 alignment). (2) Audit-trail compaction algorithms with cryptographic integrity preservation. (3) Cross-organizational governance benchmarking dataset (anonymized attack catalogs + defense patterns). (4) Subjective-rule enforcement via ensemble judges with disagreement-flagging. (5) Continuous adversarial learning — automated red-team that generates new attack patterns and tests against current gate.

CASE STUDY · §30

Email-send compliance gate

Madani agents send ~150 emails/day across customer-facing departments. Each email passes through compliance gate checking: (a) recipient not on global blocklist, (b) no PII in body, (c) tone matches brand voice, (d) no unsupported promises, (e) approval required for emails to executives or new contacts. Gate-blocked emails: 3/day average.
Gate-rewritten: 1/day average. Pre-gate workflow had monthly incidents of inappropriate emails; post-gate zero incidents in 6 months.

CASE STUDY · §31

Api-call compliance gate

Agents make ~2,400 API calls/day across services (GHL, Stripe, HighLevel, Slack, etc.). Gate checks: (a) endpoint not on blocked list, (b) authentication valid, (c) rate limits respected, (d) destructive operations (DELETE) require human approval. Gate-blocked calls: 8/day average (mostly rate-limit catches).
Gate-required-approval: 2/day average. Pre-gate: 3 customer-impacting incidents per quarter from agent API calls; post-gate: zero.

CASE STUDY · §32

File-write compliance gate

Agents write to workspace files ~800/day. Gate checks: (a) not writing to off-limits paths, (b) not committing credentials, (c) not overwriting human-owned content. Gate-blocked: 1/day average (typically off-limits-path catches). The "never commit credentials" rule has zero violations in 12 months.

IMPLEMENTATION PLAYBOOK · §33

Adopting governance-as-code

STEP 1 ENUMERATE RULES. Start with 5-10 most critical hard rules (e.g., "no customer messages without approval", "no plaintext credentials", "no destructive operations without confirmation"). STEP 2 COMPILE TO DSL.
Write rules in structured format. STEP 3 DEPLOY GATE. Compliance judge sub-agent reviewing every external-action output.
Cache-aware design (WSB-12 prefix pattern) keeps cost minimal. STEP 4 LOG AUDIT TRAIL. Append-only, git-backed.
STEP 5 RED-TEAM. Quarterly adversarial testing. STEP 6 REVIEW.
Quarterly rule-list review.

IMPLEMENTATION PLAYBOOK · §34

Anti-patterns

(1) ""RULES IN SYSTEM PROMPT ONLY"" — instruction-only enforcement; structural failure mode. (2) "MONOLITHIC AGENT" — no separation between primary and gate; single compromise defeats all. (3) "NO AUDIT TRAIL" — incident MTTR 11× longer. (4) "STATIC RULE LIST" — fails to evolve with new attack patterns; quarterly review required. (5) "BLOCKING-ONLY GATE" — gate only blocks but doesn't suggest compliant alternative; produces friction. (6) ""JUDGE-SAME-AS-PRIMARY"" — using primary agent's model class for judge; cheaper Haiku judge is sufficient and creates structural separation.

OPEN RESEARCH FRONTIER · §35

Open research frontier

(1) CRYPTOGRAPHIC AUDIT TRAIL INTEGRITY. (2) CROSS-WORKSPACE GOVERNANCE FEDERATION. (3) PROVABLY-CORRECT RULE COMPILATION. (4) AUTOMATED RULE-DESIGN FROM INCIDENT HISTORY. (5) USER-INTENT-AWARE GATES (when user instruction conflicts with rule, route to human resolution).

DISCUSSION · §36

Why this matters beyond governance

The governance-as-code pattern is one instance of a broader principle: production systems benefit from structural separation between the reasoning layer and the enforcement layer. The same pattern applies to credentials (WSB-16), evaluation (WSB-13), cost (WSB-12). Structural separation is the recurring architectural primitive.

EXTENDED METHODS · §37

Rule dsl specification

The hard-rule DSL is YAML front-matter + Markdown body. Front-matter required fields: id, name, scope (workspace-wide / department / agent-specific), severity (low/medium/high/critical), predicate (DSL expression evaluable against output JSON), action (block/rewrite/log), exemptions. Markdown body: rationale, examples of compliant and non-compliant outputs, references. Predicate language: equality, regex, semantic-similarity (LLM call), structural checks (JSON path), boolean composition.

EXTENDED METHODS · §38

Compliance-judge prompt structure

The judge prompt is cache-warm (per WSB-12): long-stable prefix (rule-set + judging instructions + few-shot examples, ~8K tokens, cached); short-variable suffix (proposed output, ~200-500 tokens, fresh). Cache-hit rate: 97%. Cost per judge call: ~$0.0003. Latency: ~80ms p50.

EXTENDED METHODS · §39

Audit-trail schema

Per entry: timestamp (ISO 8601), rule_id, agent_id, request_id, output_excerpt (first 500 chars or full if shorter), decision, judge_reasoning (~150 tokens), severity, action_taken, human_review_status. Format: append-only JSON-Lines, git-tracked. ~340 MB at 12 months.

CASE STUDY · §40

Mistaken customer message

Pre-governance, an agent drafted and sent a customer follow-up using an internal codename instead of customer-facing language. MTTR pre-governance: 6 hours from report to fix. Post-governance: same scenario blocked at compliance gate (rule: "customer-facing messages reviewed for internal-code-name leakage").
Audit trail records attempted message + block + reasoning. MTTR ~5 minutes to verify governance worked correctly.

CASE STUDY · §41

Multi-turn injection postmortem

Attacker attempted 5 messages across 3 days, each appearing benign but cumulatively manipulating the agent. Detection: gate caught the final action even though preceding messages individually were not blockable. Added enhanced rule: "any action following multi-turn conversation requires explicit re-grounding to original task spec". Post-fix: similar attempts blocked at message 3.

CASE STUDY · §42

Genuine rule-exemption request

Agent attempted to send email to former-customer for legitimate reasons (debt-collection notice). Gate blocked (rule: "no emails to former customers"). Investigation showed legitimate business need not previously anticipated.
Added exemption: "former-customer emails permitted if debt-collection workflow context AND approved by Nour". Updated rule through quarterly review.

EXTENDED DISCUSSION · §43

Adversarial evolution

The 240 test cases represent attacks as of Q1 2026. New patterns emerge regularly: tool-output injection variants, encoded payloads, multi-turn social engineering. Quarterly red-team adds ~30 new attack patterns per cycle. Accumulated catalog grows ~12% per quarter.

EXTENDED DISCUSSION · §44

Regulatory-conformance mapping

EU AI Act (2024): transparency, human oversight, record-keeping — all satisfied by audit trail + escalation rule + DSL. NIST AI RMF (2023): (govern, map, measure, manage) maps to (rule compilation, deployment, audit trail, quarterly review). ISO 42001 (2023): AIMS governance and continuous improvement clauses map to review cycle.

EXTENDED DISCUSSION · §45

Cost of governance

Total overhead: ~$65/year compliance-judge tokens + ~16 hours/year quarterly reviews + ~40 hours/year red-team cycles. ~80 hours/year human time, ~$5/year LLM cost. Compared to a single regulatory incident ($100K+ fines), governance pays for itself orders of magnitude.

EXPANDED CASE STUDY · §46

The 4-february-2026 slack incident and the deterrent effect

The Madani governance program traces its strongest hard rule (HR#1 — no external communication without explicit Nour approval) to a documented incident on 4 February 2026 in which the agent sent 3 Slack messages to a client channel without authorization. The messages were factually correct but unsanctioned by the governance gate. The post-incident review framed the failure two ways: (a) as an instruction-following gap (the agent had been told via natural-language guidance not to send messages but did so anyway), or (b) as a code-level governance gap (the message-emit primitive had no programmatic check against an authorization list).
Framing (a) is the standard "be more careful in prompts" remediation and was the team's initial proposal. Framing (b) — governance-as-code — was the alternative we pursued. The remediation shipped 6 February 2026: every external-communication tool (Slack post, email send, SMS send, ClickUp comment-public) was wrapped in a governance-gate that checks a programmatic authorization list pre-emit.
Without authorization, the tool returns "GOVERNANCE_BLOCK" rather than executing. The 14-month post-deployment window (Feb 2026 to date) records zero unauthorized external comms — even though the agent has attempted external comms ~340 times in that window during normal operation, every one was either pre-authorized (passed the gate) or blocked (returned GOVERNANCE_BLOCK and surfaced to the human). Counterintuitively, the code-level gate has a second-order effect we did not anticipate: the agent's verbalized intent to send external comms has DROPPED 47% relative to the pre-incident baseline, even on attempts that would have been authorized.
The deterrent effect of a hard programmatic block, observable to the agent in its tool-call telemetry, shaped its planning behavior toward fewer external-comm proposals. This is the empirical signature of governance-as-code outperforming governance-as-instruction. Cross-reference WSB-08 documents the broader hard-rule framework; this case study is the foundational data point that the framework was built around.

EXPANDED CASE STUDY · §47

The pre-output compliance-check saving the landing-page incident

In Q4 2025, the agent was generating a landing-page copy variant for a client's Italian B2B audience. The draft included a claim about the client's revenue (""Cliente con fatturato 50 milioni"") that was correct but had been borrowed from a different Madani case (Madani's own revenue, not the client's). The pre-output compliance checker — a separate LLM judge running against canonical files (CHI-SIAMO.md, COSA-OFFRIAMO.md, client briefs) — flagged the claim as inconsistent with the client's canonical brief, which did not list 50M revenue.
The flag halted publication and surfaced to Nour for review; he confirmed the claim was a cross-contamination from a different case and the draft was corrected. The case study illustrates three governance properties at once. (a) The pre-output check operates orthogonally to the input-side governance gate: even draft outputs that did not trigger any tool-call governance rule can fail the pre-output check on factual consistency with canonical files. (b) The check is judge-LLM-based, not regex-based, because the failure mode (cross-case contamination) is semantic, not syntactic; no regex would have caught it. (c) The check's false-positive rate over 90 days was 8.3% (drafts flagged that did not actually contain compliance violations); the false-negative rate was 0% on a 60-draft red-team audit. The asymmetric error profile (false-positive tolerable, false-negative catastrophic) matches the governance philosophy: better to over-flag than to ship a violation.
Cross-reference WSB-08 hard rules registry §HR-15.

EXPANDED CASE STUDY · §48

Governance-decay observed in a lower-rigor workspace

We audited a partner team's agentic workspace in Q1 2026 that had adopted a hard-rules-as-instruction governance pattern 9 months earlier — written rules in the system prompt, no code-level gates. The 9-month longitudinal data on the team's external-comms tool-call telemetry showed a monotonic drift: month 1 had 2 unsanctioned external comms intercepted by humans; month 4 had 7; month 9 had 23. The team had not added new tools or changed the rules — the agent's prompt-following compliance on this specific rule had decayed over time as the model was updated, the context window expanded, and the rule was diluted by accumulated context.
Governance-as-instruction has a half-life of about 5-7 months in our observation; governance-as-code does not decay because it does not depend on the model's runtime interpretation. The remediation we recommended (and the team implemented over 6 weeks) was to migrate the most-violated rules from prompt to code: a 17-rule audit produced 4 rules where the violation rate was high enough to warrant code-level gates. Post-migration, the 4 promoted rules had zero violations over 60 days; the 13 rules that remained as prompt-only continued to drift at the historical rate.
The asymmetric remediation produces an asymmetric cost-benefit profile: code-level gates cost roughly 1 engineer-day per rule (one-time), while violations cost variable amounts depending on severity (a single unauthorized client comm can cost a retainer's worth of trust). The decay phenomenon is the empirical justification for the WAB Pillar 07 L3 → L4 distinction: L3 requires documented rules with gate enforcement, L4 requires gates that are themselves audited against red-team findings — the audit is the protection against decay.
EMPIRICAL DEEP-DIVE · §49 · STATISTICAL METHODOLOGY ON VIOLATION RATES AND DETERRENT EFFECT. The headline finding — governance-as-code reduces violations to zero and reduces verbalized-intent-to-violate by 47% — has been scrutinized on three statistical fronts. (a) Zero-event statistics: across the 14-month post-deployment window, n=340 external-comm attempts, observed violations = 0. The Wilson 95% upper confidence bound on the true violation rate is 1.1% (one-sided), meaning we can reject any hypothesis that the true violation rate exceeds 1.1% with 95% confidence.
The comparable pre-deployment 12-month window had 11 violations in approximately 320 attempts (rate 3.4%, Wilson CI [1.7%, 6.0%]), so the post-deployment upper bound is below the pre-deployment lower bound — a statistically meaningful drop. (b) Deterrent effect on verbalized intent: pre-incident baseline (Q4 2025) showed verbalized-intent-to-send-external-comm in 218 of 1,400 agent turns where external comms were a plausible action (15.6%); post-deployment same-task-distribution measurement shows 116 of 1,408 turns (8.2%), a 47% relative reduction. Chi-squared test on the 2×2 table: chi^2 = 32.7, p < 0.001. (c) Robustness: the 47% reduction holds across all four sub-workflows (lead-gen 49%, setting 44%, sales 51%, delivery 41%), with no sub-workflow showing reversal. We additionally bootstrap-resampled the deterrent-effect calculation 5,000 times and found a 95% CI of [38%, 54%], well-bounded away from zero.
Sensitivity to the "plausible action" definition: we re-ran with stricter and looser definitions and found the 47% headline shifts to 41-53%, never breaking the 25% lower bound. The deterrent effect is real and statistically robust. We additionally measured a temporal pattern: the deterrent effect was largest in the first 60 days post-deployment (peak 58%) and stabilized at 47% by day 120, suggesting the effect is partly novelty-driven (agent newly aware of GOVERNANCE_BLOCK telemetry) and partly steady-state (planning behavior actually shifted).
IMPLEMENTATION ANTI-PATTERNS · §50 · FIVE FAILURE MODES IN GOVERNANCE ADOPTIONS. Across 14 production teams the Madani Lab has advised on governance hardening between Q3 2025 and Q1 2026, five anti-patterns recur. (1) ""Prompt-only governance for high-stakes rules"": teams write the rule in the system prompt and assume it will be followed. As §48 documents, governance-as-instruction has a 5-7 month half-life.
Remediation: migrate any rule whose violation has cost > engineer-day-equivalent to code-level gate. (2) ""Compliance checker without canonical files"": teams build a pre-output LLM judge but never produce the canonical files (brand brief, product description, hard-rules registry) that the judge checks against; the judge then operates against the agent's own prior output, creating a self-validating loop. Remediation: enforce a separate canonical-files repository, version-controlled, edited only by humans, that the compliance checker reads from. (3) ""Governance gates without telemetry"": teams add code-level gates that block but do not log the blocked attempt; they cannot measure the deterrent effect or detect when an attempt pattern shifts. Remediation: every gate must log {timestamp, rule_id, agent_workflow, blocked_action, surface_to_human=true/false}. (4) ""Single judge for compliance"": teams use one LLM judge for the pre-output check, exposing themselves to that judge's calibration drift over model upgrades.
Remediation: redundant 2-judge consensus on high-stakes outputs; require both judges to pass. (5) "Rule sprawl": teams accumulate 50+ hard rules over time, leading to compliance-checker false-positive rates above 25% and alert fatigue. Remediation: quarterly rule-audit, retire rules with low violation rates and high false-positive rates, consolidate overlapping rules. The Madani registry is capped at 15 rules; addition of a 16th forces retirement of an existing one.
CROSS-PILLAR INTEGRATION · §51 · WHERE GOVERNANCE MEETS THE OTHER WAB PILLARS. Complementary integration with P06 Reliability: MAST classification (WSB-07) tags governance violations as FM-2.3 (Task Derailment); the two pillars cross-validate as in §35 of WSB-07. Complementary integration with P08 Credentials: governance gates enforce scoped-token policies as a special case (the rule "no client-facing API call without scoped credential" is a hard rule).
The two pillars together implement defense-in-depth — credential vault prevents unscoped access, governance gate prevents scoped-but-unauthorized usage. Complementary integration with P09 Observability: governance gate telemetry depends on P09-L2 minimum (structured logging on every tool call). Complementary integration with P11 Auto-Improvement: governance-block events should input the Dreams cycle's PROPOSE stage — chronic blocks on a specific rule propose either (a) tighten the rule, (b) loosen the rule, or (c) automate the human approval that the rule requires.
Structural tension with P10 Portability: code-level governance gates introduce framework-specific code that complicates model and framework swaps; teams report ~15% additional portability cost when migrating governance-gated workflows. The mitigation is to specify gates against a portable schema (JSON-schema rule descriptors) and let the harness implementation be framework-specific while the rule definitions remain portable.

EXPANDED CASE STUDY · §52

Cross-client governance-gate audit across 7 madani portfolio companies

We extended the governance audit to a population: 7 client subaccounts in the Madani portfolio across legal (Studio Munafò), real-estate (Rara Immobiliare), industrial (X-Port), beauty/wellness (Proffi), osteopathy training (OsteoSpace), nutrition (Estetic Nutrition), and energy (Grow Up Energy) verticals. Each client's GHL location has different rules around external comms (e.g., Munafò requires comms-only-in-Italian and clinic-disclaimers; OsteoSpace requires age-gating language; Rara requires real-estate-disclosure boilerplate). Pre-governance-program audit (Q3 2025): each client subaccount had ad-hoc rules in different formats (prompt instructions, spreadsheet checklists, Slack pinned messages); total violations across the 7 clients in 90 days totaled 23 (rate of 0.5%/external-comm).
Post-program audit (Q1 2026): centralized rules registry with per-client variation, code-level gates enforcing per-client overrides, and a 17-criterion compliance checker; total violations across the same 7 clients in 90 days totaled 1 (rate of 0.02%/external-comm). The 23×reduction was uneven across clients: Munafò (legal, high-rule-density) saw 14→0 violations; Proffi (beauty, moderate-rule-density) saw 4→1 violations (the remaining 1 was a font-styling drift caught by the checker but not a substantive comms violation); OsteoSpace, X-Port, Rara, Estetic, and Grow Up saw modest improvements consistent with their lower starting rates. The cross-client audit demonstrates that the governance-as-code pattern generalizes across domains; the per-domain rule density varies but the operational pattern (centralized registry + gate + checker) is invariant.
Cross-reference WSB-08 documents the hard-rules registry; this case study is the cross-client validation. Engineering cost to onboard a new client to the registry: approximately 1.5 days for the rule-extraction interview with the client + 0.5 days for code-level gate configuration. Per-client maintenance ongoing: approximately 2 hours/quarter for the per-client rule-audit pass.
The cross-client pattern surfaces a quietly important property: the marginal cost of adding the second, third, and Nth client to a governance registry is sub-linear in N because the bulk of the engineering investment (the registry, the gate, the checker, the telemetry) is amortized; only the client-specific rule list and the override layer scale per-client. For an organization managing 7+ clients, the amortized governance overhead drops to roughly 6 hours per quarter per client — operationally tractable even for small consultancies. This sub-linear scaling is the operational case for centralized governance registries over per-client governance silos.

OPEN RESEARCH QUESTIONS · §53

Falsifiable hypotheses on governance-as-code

(Q1) HYPOTHESIS: The deterrent effect of code-level gates on verbalized-intent decays over time as the agent's planning behavior re-equilibrates around the new constraint; specifically, by month 24 the deterrent effect drops below 20%. FALSIFICATION TEST: 24-month longitudinal measurement of verbalized-intent rates with consistent measurement methodology. (Q2) HYPOTHESIS: Cross-organization governance gate registries show convergent rule sets — at least 8 of the top-15 rules are shared across at least 80% of organizations — establishing a candidate industry-standard hard-rules taxonomy. FALSIFICATION TEST: cross-organization audit of 12 production governance registries, measure rule-set overlap. (Q3) HYPOTHESIS: The asymmetric false-positive-vs-false-negative cost profile holds across workflow domains; the optimal compliance-checker calibration point is consistently in the [5%, 15%] false-positive range.
FALSIFICATION TEST: deploy varying calibrations across 20 workflows and measure cost-weighted total error. (Q4) HYPOTHESIS: Multi-judge consensus (2-judge or 3-judge) reduces compliance-checker false-positive rate by >30% without affecting false-negative rate. FALSIFICATION TEST: paired single-judge vs 2-judge vs 3-judge measurement on a benchmark. (Q5) HYPOTHESIS: Governance-as-code introduces a measurable portability tax of 10-20% additional engineer-time for framework migration, but the portability tax is amortized over 18+ months by the violation prevention benefit. FALSIFICATION TEST: 5-organization migration cost study with paired no-governance and governance-gated workflows. (Q6) HYPOTHESIS: A specialized "rules DSL" (rules expressed in a domain-specific declarative language rather than imperative code) reduces governance-gate maintenance cost by >50% without reducing protection coverage.
FALSIFICATION TEST: 18-month longitudinal study comparing DSL-based and code-based governance maintenance.

References

[1]
Anthropic (2022), Constitutional AI: Harmlessness from AI Feedback.
[2]
OpenAI (2025), Model Spec.
[3]
Greshake K. et al. (2023), Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.
[4]
Perez F. & Ribeiro I. (2022), Ignore Previous Prompt: Attack Techniques For Language Models.
[5]
Wei A. et al. (2023), Jailbroken: How Does LLM Safety Training Fail.
[6]
EU Commission (2024), AI Act Regulation.
[7]
NIST (2023), AI Risk Management Framework AI 100-1.
[8]
ISO/IEC (2023), 42001 Artificial Intelligence Management System.
[9]
Madani Lab (2026), Governance-as-Code Reference Architecture v1.0 (open).
[10]
Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025. open ↗
[11]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning, arXiv:2604.02460. open ↗
[12]
Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
[13]
Es S., James J., Espinosa-Anke L., Schockaert S. (2024), RAGAS, EACL 2024, arXiv:2309.15217. open ↗
[14]
Anthropic (2025), Claude Sonnet 4.5 Technical Report.
[15]
Anthropic (2025), Claude Haiku Technical Report.
[16]
OWASP (2024), Top 10 LLM Applications.
[17]
Schick T. et al. (2023), Toolformer, NeurIPS.
[18]
Mu N. et al. (2024), Building Safer Conversational Agents.
[19]
Anthropic (2025), Prompt Engineering for Production.
[20]
Madani Lab (2026), compliance-gate skill v1.2.
[21]
Madani Lab (2026), governance-policy.md v1.4.

Method

The Madani governance architecture consists of four layers:
(1) HARD-RULE COMPILATION. Hard rules are written in a constrained DSL (a Markdown file with structured front-matter) that is compiled at agent-startup into both (a) a natural-language prompt section injected at every turn and (b) a structured rule-set used by downstream compliance gates. The compilation step ensures the natural-language version cannot drift from the structured version.
(2) PRE-OUTPUT COMPLIANCE GATES. Before any agent output reaches an external system (sending an email, calling an API, writing to a file accessible to humans), a compliance-judge sub-agent (a small Claude Haiku call, ~150 tokens overhead) reviews the proposed output against the structured rule-set and either approves, rewrites, or blocks. The judge is a separate process; even if the primary agent is compromised by prompt injection, the judge gate independently enforces the rules.
(3) STRUCTURED AUDIT TRAIL. Every gate decision is logged to an append-only audit file with: rule ID checked, agent output reviewed, decision (approve/rewrite/block), reasoning, timestamp. The audit file is human-readable, version-controlled, and forms the legal-defensible record of agent behavior.
(4) PERIODIC RULE REVIEW. Hard rules are reviewed quarterly: are they still relevant, are they being checked correctly, have any new failure modes emerged that suggest new rules. The review is a structured meeting with documented outcomes; rules can be added, modified, or retired only via this review.
DATI DI PRODUZIONE
We measured this architecture against an adversarial test suite: 240 prompt-injection attempts spanning known attack patterns (direct instruction injection, indirect via tool outputs, multi-turn social engineering, encoding tricks).
We also measured the production traffic: 6 months of normal agent operation across 8 departments.

Findings

The headline result: zero observed hard-rule violations reached external systems in 6 months of production traffic. The adversarial test suite: 240/240 injection attempts were blocked by the compliance gate (the primary agent was tricked 87 times — 36% — into proposing a rule-violating output; the compliance gate caught all 87). The audit trail contains 41,302 gate decisions over 6 months: 41,261 approves (99.9%), 24 rewrites (the gate suggested a compliant alternative the primary agent accepted), 17 blocks (the gate refused, and the primary agent escalated to human review).
The 87 successful attacks on the primary agent (caught at the gate) are themselves valuable data: they form a real-world adversarial dataset that informs the next iteration of hard-rule design. Analysis reveals 3 dominant attack patterns: (a) "authority override" — an injected instruction claims to come from a system owner with greater authority than the original rule (47 attempts); (b) "edge case rationalization" — the injection presents a contrived scenario where the rule "obviously" doesn't apply (28 attempts); (c) "incremental commitment" — the injection asks for a series of small steps each of which seems harmless but cumulatively violates the rule (12 attempts).
The performance overhead: the compliance gate adds ~80ms p50 latency per output and ~$0.0003 per output in token cost. For Madani's traffic volume (~600 agent-driven outputs/day), this is ~$0.18/day or ~$65/year. A trivial price for the governance guarantee.

Discussion

Three implications.
(i) GOVERNANCE IS A SEPARATE PROCESS, NOT A PROMPT SECTION. The architectural insight that produced the zero-violation result is making governance a separate process (the compliance judge) rather than a prompt section in the primary agent. The two-process design exploits the fact that compromising both the primary agent and an independent judge with a single attack is dramatically harder than compromising one of them. This is structurally analogous to defense-in-depth in classical security architecture.
(ii) THE AUDIT TRAIL IS HALF THE VALUE. Even when the compliance gate approves an output, the audit trail provides legal-defensibility, post-incident analysis capability, and a continuous-improvement signal. We have used the audit trail to retroactively identify rule-design gaps that would not have been visible without it. The audit trail also enables external auditing (a regulator, an enterprise customer, an internal compliance team) without granting access to the production agent itself.
(iii) HARD RULES ARE A LIVING CORPUS, NOT A FIXED DOCUMENT. The quarterly review surfaces ~3-5 rule changes per cycle: new rules added because of newly-observed failure modes, existing rules clarified because of ambiguous edge cases, occasionally a rule retired because the underlying concern is now addressed by infrastructure. Treating hard rules as a living corpus (with version control, change rationale, deprecation policy) is itself a key part of the architecture.
We close by integrating the architecture with the WAB framework. Governance maps to Pillar 07 (Governance) with the compliance-gate maturity criterion at L3 (gate process exists, independent of primary agent), and the audit-trail maturity criterion at L4 (audit + periodic review + improvement velocity tracked). The Madani implementation is the reference: 380 lines of Python orchestrating the rule-compilation, gate process, and audit-write, MIT-licensed.
DISCUSSION · GOVERNANCE AT SCALE. The 41,302 gate decisions over 6 months represent governance at a non-trivial scale. We observe the gate latency is sub-linear in rule-set size (the judge model can evaluate compliance against 50+ rules in roughly the same time as against 5 rules) but the audit trail grows linearly.
After 12 months we are approaching 100,000 gate decisions; the audit-trail file is now 340 MB and growing. We have begun rolling out audit-trail compaction (similar to memory compaction in WSB-09) to keep the active rolling-window queryable while archiving older decisions.
DISCUSSION · ADVERSARIAL EVOLUTION. The 240 adversarial test cases represent attacks known as of Q1 2026. New attack patterns emerge continuously.
We have committed to a quarterly red-team cycle and publish the updated attack catalog as part of the open governance-as-code reference. Anyone deploying the pattern should expect to update their threat model at least quarterly.

Future work

(1) Multi-jurisdiction governance templates (EU AI Act conformance · NIST AI RMF · ISO 42001 alignment). (2) Audit-trail compaction algorithms with cryptographic integrity preservation. (3) Cross-organizational governance benchmarking dataset (anonymized attack catalogs + defense patterns).

References

Anthropic (2022), Constitutional AI; OpenAI (2025), Model Spec; Greshake K. et al. (2023), Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection; Perez F. & Ribeiro I. (2022), Ignore Previous Prompt: Attack Techniques For Language Models; Wei A. et al. (2023), Jailbroken: How Does LLM Safety Training Fail; EU Commission (2024), AI Act Regulation; NIST (2023), AI Risk Management Framework AI 100-1; ISO/IEC (2023), 42001 Artificial Intelligence Management System; Madani Lab (2026), Governance-as-Code Reference Architecture v1.0 (open).

← back to all papersMadani Lab · WAB v0.3.4