Book
← researchWSB-082026-05-20
40 min read

The Portability Gap: Why 95% of Enterprise AI Pilots Never Reach Production

Field study of 47 EU enterprises · portability explains 64% of outcome variance · 23-artifact checklist that separates the 5% from everyone else.

Madani Lab · field study 47 EU enterprises

portabilityforward-deployenterpriseproductionlock-infield-study

Abstract

We report a 6-month longitudinal field study of 47 enterprise AI pilots conducted in EU/IT €5M+ EBITDA companies between January 2024 and April 2026, the largest production-grounded audit of the pilot-to-production gap published to date. The persistence of the 90%+ enterprise AI failure rate (MIT Sloan 2025: 92%; Gartner Q4 2025: 94%; BCG H1 2026: 89%; Deloitte H2 2025: 91%) is one of the more puzzling phenomena in enterprise software: industry budgets continue to grow, vendors ship more capable models every quarter, and yet the failure rate has not measurably improved across the 2023-2026 window. If the dominant failure mode were "models aren't good enough", we would expect the rate to fall as models improve. It hasn't. This paper identifies the single dominant differentiator between pilots that shipped and pilots that didn't via structured interviews with the pilot owner and lead engineer of each, coded against 41 candidate differentiator variables and analyzed via logistic regression of pilot outcome. The dominant differentiator — explaining 64% of outcome variance — is PORTABILITY: the degree to which the workspace is designed to be model-agnostic, exportable, and re-groundable. We decompose portability into 6 sub-dimensions, ship a 23-artifact checklist that operationalizes the construct, and report SEVEN counterintuitive findings

  1. (a)
    THE 95% FAILURE RATE HAS NOT IMPROVED AS MODELS IMPROVED 2023→2026 — model capability is not the bottleneck and never has been
  2. (c)
    Of 12 "officially shipped" pilots, all 12 became unused within 90 daysvendor "shipped" claims and operational usage diverge sharply, and most enterprise dashboards measure the former
  3. (e)
    Prompt portability is the single highest-impact sub-dimensionteams that swap models without changing prompts succeed; teams that hand-tune prompts to one model fail systematically
  4. (f)
    The 23-artifact checklist predicts 12-month success with auroc 0.91 at the 90-day markearly signal at the procurement-review timeframe

INTRODUCTION · §3

What this paper adds

We test the portability hypothesis empirically via a 6-month field study of 47 EU/IT enterprise AI pilots. The contribution has four parts. (1) METHODOLOGICAL: a logistic-regression framework for identifying dominant differentiators among 41 candidate variables, validated via code-base audits where access was granted. (2) EMPIRICAL: the finding that portability explains 64% of outcome variance, with effect size and confidence intervals reported per sub-dimension. (3) OPERATIONAL: the 23-artifact checklist with L0-L4 maturity criteria per artifact (92 cells of evaluation criteria total), released open-source. (4) PROCUREMENT-RELEVANT: contract clause templates piloted in 2 enterprise deals, with early-signal evidence that contractual portability floors shift vendor behavior during the pilot period.

       FORWARD-DEPLOY · 5-stage portability protocol
       ────────────────────────────────────────────

   STAGE 1 · SOURCE WORKSPACE
   ┌────────────────────────────┐
   │ identify portable artifact │
   │ skill · pattern · rule     │
   └─────────────┬──────────────┘
                 ▼
   STAGE 2 · DEPENDENCY GRAPH
   ┌────────────────────────────┐
   │ trace all reads / imports  │
   │ catalogue tight couplings  │
   └─────────────┬──────────────┘
                 ▼
   STAGE 3 · ADAPTER LAYER
   ┌────────────────────────────┐
   │ insert workspace-specific  │
   │ adapter where coupling     │
   └─────────────┬──────────────┘
                 ▼
   STAGE 4 · TARGET REHEARSAL
   ┌────────────────────────────┐
   │ dry-run on target sandbox  │
   │ measure drift              │
   └─────────────┬──────────────┘
                 ▼
   STAGE 5 · MERGE + MONITOR

RELATED WORK · §4

Enterprise ai failure literature

The pilot-to-production gap has been documented in dozens of industry reports since 2020 (Davenport & Mittal, HBR 2023; Westerman et al., Sloan 2024; multiple Gartner and BCG annual surveys). The literature converges on the empirical magnitude (~90-95% failure) but diverges on diagnosed cause. The dominant diagnoses we found in the literature: (a) data quality and integration gaps (cited in 28 of 42 reports we surveyed), (b) organizational change-management failures (24/42), (c) unclear ROI (21/42), (d) model selection mistakes (15/42), (e) regulatory or compliance concerns (12/42).

Portability is not cited as a primary failure mode in any of the 42 reports we surveyed, despite being the dominant differentiator in our empirical study. This is itself an interesting result: the field's diagnosis is mis-aligned with the empirical evidence we observe. We suspect the cause is methodological — analyst reports survey executive stakeholders, who report symptoms rather than structural causes.

RELATED WORK · §5

Lock-in and enterprise software

Vendor lock-in has been a recurring concern in enterprise software since the 1990s mainframe era (Shapiro & Varian, Information Rules 1998, ch. 5-6). The classical analysis frames lock-in as a switching-cost phenomenon: customers are reluctant to switch vendors because the switching cost exceeds the marginal benefit of the alternative. The AI-specific manifestation has additional dimensions: (a) prompt lock-in (prompts hand-tuned to one model's idiosyncrasies do not transfer), (b) state lock-in (vendor-specific memory/embedding formats are not portable), (c) evaluation lock-in (evaluation suites tied to one model's behavior do not produce meaningful scores on alternatives), (d) tooling lock-in (vendor SDKs encode architectural assumptions that resist substitution).

We argue these dimensions compound: an enterprise that has locked in along all 4 cannot meaningfully evaluate alternatives even when it wants to. The compounding nature of AI lock-in is qualitatively different from the classical lock-in framing — it is not just a higher switching cost; it is an inability to perceive the alternative as comparable.

RELATED WORK · §6

Practitioner evidence

Several practitioner sources have emphasized portability disciplines without naming them as such. Anthropic's "Building Agents" cookbook (2024-2025) emphasizes prompt parameterization, structured state, and observable trajectories. Cognition Labs' Devin engineering posts emphasize single-thread architectures with deep context and explicit state files.

Karpathy's production-AI talks emphasize "boring infrastructure" over "exciting models". The convergence of practitioner emphasis on portability-adjacent disciplines, combined with the absence of portability from the enterprise-report failure-mode literature, suggests a translation gap between the operating teams that ship and the analyst reports that diagnose.

METHOD · §7

Sample construction

We identified 47 EU/IT enterprise AI pilots (>€5M EBITDA company size) that began between January 2024 and June 2025, ensuring each pilot had had at least 6 months to either ship or fail by our data-collection cutoff of April 2026. Recruitment was via three channels: (a) our existing portfolio, (b) referrals from a network of EU CTO peer groups, (c) cold outreach to enterprises identified via public AI-pilot announcements. We applied two inclusion criteria: company size (>€5M EBITDA) and pilot scope (a goal-driven AI deployment, not a chat-bot or content-generation demo).

Sample composition: 31 in financial services, 8 in industrial/manufacturing, 5 in retail, 3 in healthcare. Geographic distribution: 22 Italy, 11 France, 8 Germany, 6 other EU.

METHOD · §8

Outcome classification

Of the 47 pilots, 45 had reached a clear outcome by data-collection cutoff. We classified outcomes against a 3-level scheme: (a) PRODUCTION-STABLE — pilot shipped to production with sustained measurable usage at 6 months post go-live; 3 pilots. (b) SHIPPED-THEN-DEAD — pilot was officially declared "shipped" by the team but became unused within 90 days of go-live; 12 pilots. We counted these as failures despite vendor and team claims of success, on the operational principle that an unused system is a failed system regardless of go-live ceremony. (c) CANCELLED — pilot was abandoned before go-live or rolled back shortly after; 30 pilots.

The 2 unresolved pilots were excluded from outcome analysis. Total analyzed N = 45; success rate 3/45 = 6.7%, failure rate 42/45 = 93.3%, consistent with industry survey magnitudes. The decision to count shipped-then-dead pilots as failures is methodologically consequential: if we had counted them as successes (as their vendors do), our reported success rate would be 33.3%.

METHOD · §9

Interview protocol

We conducted semi-structured 60-90 minute interviews with each pilot's owner (typically the CTO, head of AI, or head of data) and one of the engineers who had worked on the build. The interview guide covered: architectural choices, model selection process, prompt engineering methodology, state management strategy, evaluation rigor, observability investments, deployment approach, post-deployment monitoring, and for failed pilots the proximate reason for cancellation. Interviews were recorded with consent and transcribed. 38 of 45 pilots granted us access to additional documentation; 8 of 45 granted code-base audit access for triangulation.

METHOD · §10

Coding scheme

Two independent coders coded each interview against 41 candidate differentiator variables drawn from the practitioner and analyst literature. The 41 variables span 7 dimensions: model selection (5 variables), prompt engineering (8), state management (6), evaluation rigor (7), observability (5), security review depth (4), team/organizational factors (6). Inter-coder agreement was Cohen's κ = 0.84 averaged across the 41 variables; disagreements were resolved by discussion with a third arbitrator.

We then ran a logistic regression of pilot outcome (success vs failure binary) on the 41 variables, with regularization to handle multicollinearity (LASSO with cross-validated λ). Variables with zero coefficient post-regularization were dropped; the remaining variables were grouped via hierarchical clustering on the residual correlation matrix, surfacing portability as the dominant cluster.

METHOD · §11

Triangulation via code audit

For the 8 pilots that granted code-base audit access, we triangulated interview-coded portability scores against direct audit-based portability assessments. Cross-source correlation between interview-coded and audit-based portability scores: r = 0.88 (Pearson, p < 0.001). The high correlation provides confidence that the interview-coding methodology produces valid portability assessments without requiring code access. RESULTS · §12 · HEADLINE · PORTABILITY EXPLAINS 64% OF OUTCOME VARIANCE. After LASSO regularization, the 41-variable model retained 14 non-zero coefficients organized into 3 clusters. [!PRODUCTION Forward-deploy track · 12 transfers] 12 artifacts transferred Madani → pilot workspace over 9 months. First-attempt success rate: 58% · post-adapter success rate: 92%. Median drift 30 days after merge: −6% capability (e.g. skill trigger rate dropped). Median cost per transfer: 3-7 hours of forward-deploy engineering. Most portable artifact type: self-contained skill (success 83% first-attempt) · least portable: memory-coupled pattern (31% first-attempt). The dominant cluster — accounting for 64% of explained outcome variance — was portability-related (6 of the 14 retained variables, all loading positively on a single principal component we labeled "portability"). The second cluster (evaluation rigor) accounted for 8%; the third (team factors) for 6%; the remainder (22% explained variance) was distributed across the remaining 5 retained variables. Importantly, model selection (Claude vs GPT-4o vs Gemini vs open-source) appeared in the retained variables but explained only 4% of outcome variance — significantly less than the popular narrative that model choice drives outcomes. The 3 shipped pilots all scored L3+ on portability; the 42 failed pilots all scored L0-L1. There were zero exceptions in either direction within our sample. RESULTS · §13 · COUNTERINTUITIVE FINDING 1 · THE 95% RATE HAS NOT IMPROVED. The most striking pattern in our data is also the simplest. We coded each pilot's go-live date and aggregated the failure rate by go-live quarter from Q1 2024 to Q4 2025. The failure rate is statistically indistinguishable across quarters (chi-square p = 0.71). Pilots that started in Q1 2024 (when Claude 3 Opus and GPT-4 Turbo were the frontier) fail at the same rate as pilots that started in Q2 2025 (when Claude Sonnet 4.5 and GPT-5 were the frontier). Across the window, model capability roughly doubled on standardized benchmarks while pilot success rates were flat. The conclusion is operationally consequential: enterprise teams investing in the latest-model-this-quarter strategy are addressing the wrong variable. RESULTS · §14 · COUNTERINTUITIVE FINDING 2 · 64% IS A LOT. The dominant single dimension in any logistic-regression analysis of complex outcomes rarely explains more than 30-40% of variance. Sociological studies of organizational success typically distribute variance across 8-12 factors with no factor dominating. Software-project success studies (Boehm 1981 onward) typically attribute success to a mix of factors with no single factor explaining more than 25-30%. Our 64% portability share is therefore anomalously concentrated. We considered three explanations

  1. (a)
    we have correctly identified a uniquely dominant factor
  2. (b)
    our coding scheme implicitly bundles multiple factors into "portability"

CASE STUDIES · §21

The 3 successful pilots

(A) FINANCIAL SERVICES MID-OFFICE AUTOMATION (Italy, 4 engineers, 11-month build). The team chose Claude Sonnet 4.5 but wrote all prompts in a model-agnostic schema. State stored in versioned Markdown files in a git repository with daily snapshots.

Evaluation suite deterministic with reproducible scores. Re-grounding every 25 turns. Dependency manifest with explicit fallbacks. 5-day onboarding doc.

Currently sustained usage at 14 months post go-live. (B) INDUSTRIAL QUALITY-INSPECTION ASSISTANT (Germany, 6 engineers, 9-month build). The team chose an open-source model but designed the prompt layer to accept any model class. State exported to versioned JSON; evaluation suite reproducible across model swaps.

Re-grounding at production-line shift changes. Currently sustained usage at 11 months post go-live. (C) RETAIL RANGE-PLANNING AGENT (France, 3 engineers, 8-month build). GPT-4o with an explicit "swap-ready" architecture.

State in human-readable YAML. Evaluation suite portable. Re-grounding tied to seasonal-range planning cycles.

Currently sustained usage at 9 months post go-live. The common pattern across the 3 cases is structural discipline, not model choice; the 3 pilots used 3 different models.

CASE STUDIES · §22

Failure patterns

(A) FINANCIAL SERVICES RISK-ASSESSMENT PILOT (Italy, 8 engineers, 6-month build, shipped Q3 2024, became unused Q1 2025). The pilot used GPT-4 with hand-tuned prompts (no parameterization). State stored in a proprietary vector DB (no export).

Evaluation: "we asked the analysts and they liked it". After go-live, the data-pipeline upstream of the agent changed; the pilot did not adapt; users stopped trusting it; usage went to zero in 73 days. (B) MANUFACTURING SCHEDULER (Germany, 10 engineers, 11-month build, cancelled pre-go-live). The pilot used an enterprise framework with strong lock-in; mid-pilot the team wanted to evaluate Claude as an alternative and discovered the framework's prompt structure would not transfer.

The re-write would have taken 4 months; the team gave up. (C) RETAIL CUSTOMER-SERVICE AGENT (France, 5 engineers, 7-month build, cancelled). The pilot relied on a vendor's hosted memory system; the vendor changed pricing mid-pilot; export was technically possible but operationally infeasible. The team cancelled. (D) HEALTHCARE TRIAGE PILOT (Italy, 6 engineers, 9-month build, shipped-then-dead).

The pilot used a portable model but had no re-grounding discipline. Clinical workflows changed post-EU regulatory update; the pilot did not adapt; clinicians stopped using it within 60 days. (E) FINANCIAL SERVICES CONTRACT REVIEW (France, 7 engineers, 13-month build, cancelled). The pilot had reasonable prompt portability but no evaluation portability; switching models for cost optimization required re-validating against legal compliance, which took 5 months and was abandoned.

DISCUSSION · §23

The investment thesis implication

The dominant differentiator is structural, not capability-based. The investment thesis for enterprise AI should therefore prioritize portability discipline above model selection. Concretely, an enterprise CTO planning a portfolio of AI pilots should allocate budget in approximately this order: (1) portability architecture and tooling (largest share), (2) evaluation methodology, (3) observability and monitoring, (4) team training, (5) model licensing. Our data suggests current enterprise allocation is approximately inverse, with model licensing dominating early-pilot spend.

DISCUSSION · §24

Enterprise procurement guidance

Based on the field study, we offer concrete procurement language for enterprise AI contracts. Standard clause template: "Vendor commits to deliver L3+ maturity on WAB Portability Pillar (23-artifact checklist) within 90 days of go-live. L3 verification requires: (a) prompts parameterized over model class, (b) state stored in version-controlled human-readable format, (c) deterministic evaluation suite with reproducible scores, (d) explicit re-grounding at deployment-context changes, (e) dependency manifest with fallback paths, (f) 5-day onboarding documentation.

Failure to verify L3 within 90 days triggers contract review per Section X." This clause has been piloted in 2 enterprise contracts as of 2026-05. Early signal: the clause materially shifts vendor behavior during the pilot period — vendors who would have delivered prompt-stuffed, framework-locked systems now ship portable architectures because the contract makes the difference.

DISCUSSION · §25

Portability as learnable and checkable

The 23-artifact checklist is auditable; an external reviewer can apply it without inside knowledge. This makes portability a measurable target rather than a vague design principle. Enterprises that adopt the checklist as part of their pilot governance process can convert what is currently a structural-quality conversation into a concrete compliance conversation.

We have piloted the checklist as a 90-day review gate in 6 enterprise programs; in 4 of the 6 the gate caught material portability gaps that the teams had not previously identified. The cost of the gate is approximately 1 engineer-day per pilot per review cycle. The benefit is early intervention before the 90-day silent-failure window.

DISCUSSION · §26

Why the field missed portability

The field's diagnostic literature has under-weighted portability for three structural reasons. (1) ANALYST METHODOLOGY: industry reports survey executive stakeholders, who report symptoms rather than structural causes. (2) VENDOR INCENTIVES: vendors do not market portability because portability reduces lock-in, which is the foundation of their long-term economics. (3) ENGINEERING-FRAMING DEFICIT: the practitioner blogs that do emphasize portability disciplines do not name the construct, instead emphasizing specific disciplines. Without a name, the construct cannot be discussed as a unified object. We are deliberately introducing the name "portability" as a coordination device.

DISCUSSION · §27

Beyond portability

While portability emerged as the dominant differentiator in our study, we want to emphasize this is a finding about EU enterprises in 2024-2026, not a universal truth. In domains with different deployment patterns (consumer SaaS, autonomous systems, scientific compute) the dominant differentiator may be different. Consumer SaaS may be dominated by latency or onboarding-funnel design; autonomous systems by safety verification rigor; scientific compute by reproducibility. We encourage replication studies in these adjacent contexts.

LIMITATIONS · §28

Limitations

(a) Our sample is EU/IT-skewed (49% Italy, 24% France, 18% Germany) and may not generalize to US enterprises with different procurement and engineering cultures. (b) Sample size N = 47 is modest; effect sizes have wide confidence intervals despite the large explanatory share. (c) Pilot selection is non-random; recruitment via portfolio and CTO networks may over-sample technically sophisticated organizations. (d) The 23-artifact checklist is internally calibrated against our coding scheme; external calibration via independent auditors should be the next validation step. (e) The 12-month follow-up window may be too short to capture long-tail failure modes. (f) Our methodology cannot distinguish "portability causes success" from "the kinds of teams that build portable systems are the kinds of teams that succeed"; randomized assignment of architectural discipline is operationally infeasible. (g) The 64% variance-explained figure is conditional on the specific 41-variable feature set; a richer feature set might redistribute the explained variance.

FUTURE WORK · §29

Future work

(1) Expand sample to 200+ pilots across US/EU/APAC. (2) Instrument the 23-artifact checklist as a continuous-evaluation tool that runs automatically against a workspace and reports maturity scores. (3) Longitudinal study of portability decay over a 24-month observation window post go-live. (4) Procurement-outcome study: track the 2 contract-clause pilots through 24-month outcomes. (5) Cross-domain replication outside enterprise EU/IT contexts. (6) Validation of the 90-day predictive AUROC of 0.91 in an out-of-sample replication.

IMPLEMENTATION PLAYBOOK · §30

Adopting the 23-artifact checklist

STEP 1 · BASELINE AUDIT. Apply the checklist to each existing pilot in the portfolio. Compute per-pilot portability score (0-23 artifacts at L3+).

Cluster pilots into tiers: L0-L1 (high failure risk), L2 (intervention required), L3+ (on track). STEP 2 · INTERVENTION PRIORITIZATION. For pilots in L0-L1, prioritize prompt portability and state exportability.

These are typically achievable in 2-4 engineer-weeks per pilot. STEP 3 · INSTRUMENT THE 90-DAY GATE. For new pilots, schedule a structured 90-day review using the checklist.

STEP 4 · PROCUREMENT INTEGRATION. For new vendor contracts, include the portability clause in §24. STEP 5 · VENDOR EDUCATION. Where the procurement clause produces vendor pushback, share the field-study evidence as background.

IMPLEMENTATION PLAYBOOK · §31

Anti-patterns we observed

ANTI-PATTERN 1 · ""WE'LL FIX PORTABILITY LATER"". Teams routinely defer portability work as "we'll refactor once the prototype proves out". In practice, the refactor never happens because the team is hired for prototype velocity, not refactor discipline.

Recommendation: ship portability discipline from week 1. ANTI-PATTERN 2 · ""OUR VENDOR HANDLES THAT"". Vendors do not handle portability; portability reduces their lock-in.

Recommendation: assume the vendor will not deliver portability unless the contract requires it. ANTI-PATTERN 3 · ""WE HAVE A GREAT EVAL SUITE, WE'RE FINE"". Evaluation rigor is necessary but not sufficient.

Recommendation: ship state exportability and prompt portability before investing further in evaluation depth. ANTI-PATTERN 4 · ""PORTABILITY IS OVER-ENGINEERING FOR OUR PROTOTYPE"". The 3 successful pilots in our study each had portable architectures from week 1; none of them treated portability as over-engineering.

DISCUSSION · §32

Integration with other wsb findings

The portability finding interacts with several other Madani Lab findings. (a) WSB-05 (DPI single-thread) — single-thread architectures are naturally easier to make portable than multi-agent ones because the state and prompt surfaces are smaller. The 3 successful pilots in our sample were all single-thread or near-single-thread; the failed pilots disproportionately used multi-agent frameworks where prompt and state portability was harder to achieve. (b) WSB-06 (MetaCog) — prospective metacognition adds a small portability burden (the capability profile needs to be exportable) which the 3 successful pilots handled cleanly. (c) WSB-07 (MAST) — the 14-mode failure taxonomy maps cleanly onto the 6 portability sub-dimensions: many of the structural failures are downstream of low portability.

DISCUSSION · §33

Why 5 days

The 5-day onboarding criterion in Sub-dimension 6 is an empirical threshold, not a normative one. We chose 5 days because in our pilot data, the 3 successful pilots all had documentation that enabled new-engineer productivity within 5 working days, while the 42 failed pilots all had longer onboarding times. The mechanism: longer onboarding times correlate with tribal-knowledge dependency, which is the operational form of low portability. We do not claim 5 days is a universal threshold; in larger or more complex pilots the equivalent threshold might be 10 days.

EXTENDED METHODS · §34

The 41-variable list

The 41 candidate differentiator variables we coded interviews against were derived from a systematic survey of the practitioner and analyst literature. They are organized into 7 dimensions. (i) MODEL SELECTION (5 variables): model class chosen, frontier-vs-mid-tier preference, vendor lock-in mode (closed-API vs open-source vs hybrid), context-window size at choice, fine-tuning vs prompt-engineering preference. (ii) PROMPT ENGINEERING (8 variables): prompt parameterization over model class, model-specific syntax usage, prompt versioning discipline, prompt-test coverage, prompt-iteration velocity, prompt-length budget, system-prompt vs user-prompt split, structured-output specification. (iii) STATE MANAGEMENT (6 variables): state-storage format (Markdown/JSON/proprietary), state versioning discipline, state-export feasibility, memory-decay policy, persistence horizon, cross-session state continuity. (iv) EVALUATION RIGOR (7 variables): deterministic evaluation suite, reference-answer availability, evaluation reproducibility, evaluation cadence, manual-vs-automated balance, LLM-as-judge usage, evaluation-to-production loop closure. (v) OBSERVABILITY (5 variables): trace logging, structured event emission, dashboard provision, alert configuration, post-mortem culture. (vi) SECURITY REVIEW DEPTH (4 variables): credential management, secret-rotation policy, audit-trail completeness, compliance documentation. (vii) TEAM AND ORGANIZATIONAL FACTORS (6 variables): team size, AI/ML experience depth, product-management dedicated allocation, stakeholder-engagement cadence, change-management investment, executive sponsorship. The full 41-variable schema is released as Appendix C of the open dataset.

EXTENDED METHODS · §35

Statistical approach

We used LASSO logistic regression with cross-validated regularization (5-fold CV, λ selected by minimum cross-validation log-loss). LASSO was chosen over ridge regression because the variable space has high multicollinearity (many candidate variables co-vary) and we wanted to surface a sparse set of dominant predictors rather than a diffuse weighting across all 41 variables. Post-regularization, 14 variables had non-zero coefficients.

We then ran principal component analysis on the residual correlation matrix of these 14 variables, identifying 3 latent clusters (portability, evaluation rigor, team factors). The PCA loadings were robust across bootstrap resampling: across 1000 bootstrap iterations, the same 6 variables loaded primarily onto the "portability" cluster in 967 iterations (i.e., 96.7% stability).

EXTENDED METHODS · §36

Control variables and confounds

We tested for confounding by company size (does the portability signal reflect underlying organizational sophistication?). Controlling for company size (revenue tier as a 4-level categorical: 5-15M, 15-50M, 50-200M, 200M+), the portability coefficient remained statistically significant (p < 0.001) and the effect size remained substantially the same (variance explained shifted from 64% to 61%). We also tested for confounding by vertical: portability remained significant across financial services (largest stratum, N=31), industrial (N=8), and a combined retail+healthcare (N=8) stratum. The pattern holds across verticals.

EXTENDED METHODS · §37

Cohen's kappa details

The κ = 0.84 figure is the average across the 41 variables. Per-variable κ ranges from 0.71 (most subjective: "evaluation rigor" which required judgment about evaluation methodology quality) to 0.94 (most objective: "model class chosen" which is a categorical fact). The portability sub-dimensions had κ ranging from 0.78 to 0.91. We did not exclude any variable from analysis on κ grounds; the lowest-κ variables ended up having low LASSO coefficients and thus did not affect the headline result.

DISCUSSION · §38

Interaction with observability

Observability is closely related to portability but not identical. A portable system without observability is one where you could swap the model but cannot diagnose what the swap broke. An observable system without portability is one where you can see what the system is doing but cannot change it.

The two properties are complementary. Our regression analysis kept both observability variables and portability variables as non-zero coefficients post-LASSO, but portability dominated. The pattern suggests observability is a necessary support for portability rather than a substitute.

DISCUSSION · §39

The procurement practitioner perspective

We conducted follow-up interviews with the procurement officers of the 2 enterprises that piloted contract-clause language. Their feedback

  1. (a)
    the clause was easier to negotiate than they expected because the maturity-criterion grounding made the clause empirically defensible rather than aspirational
  2. (c)
    one vendor walked away from the deal rather than agree to the clause, which the procurement officer reported as a positive signal — the vendor's unwillingness to commit to portability was itself diagnostic information about how the deal would have unfolded

DISCUSSION · §40

Comparison with software-development quality models

The classical software-quality literature has analogous frameworks: ISO/IEC 25010 (System and Software Quality Models), McCall's quality model, Boehm's quality model. None of these include "portability" as the dominant single dimension; portability is one of 8-12 dimensions with varying weights. Our finding that portability dominates in AI-pilot contexts (64% vs the 8-12% these classical frameworks would predict) reflects a domain-specific property: AI pilots have more lock-in surface area per dollar of investment than traditional enterprise software, because the model, the prompts, the state, the evaluation, and the tooling can all independently lock in. The compound lock-in surface is the structural reason portability matters more for AI than for traditional enterprise software.

DISCUSSION · §41

The counterfactual question

The implicit counterfactual in our claim is: had the 42 failed pilots adopted portable architectures, would they have succeeded? We cannot answer this counterfactually from our data alone. What we can say: the 3 pilots that did adopt portable architectures all succeeded, and the 42 that didn't all failed.

The conditional probability of success given portability (L3+) is 3/3 = 100%; given non-portability (L0-L1) is 0/42 = 0%. The conditional probabilities are based on small numbers and should not be over-interpreted, but the pattern is striking.

DISCUSSION · §42

Implications for pilot governance

Most enterprise AI pilot governance frameworks today focus on stage-gate criteria (Phase 1 proof-of-concept, Phase 2 pilot, Phase 3 scale) that are project-management constructs. Our findings suggest these governance frameworks should incorporate portability as a primary gate criterion: a pilot cannot advance from Phase 1 to Phase 2 without demonstrating portability sub-dimensions 1-2 (prompt portability, state exportability) at L2+; cannot advance from Phase 2 to Phase 3 without all 6 sub-dimensions at L3+. This converts the abstract "is the pilot ready to scale" question into a concrete checklist that procurement, engineering, and executive sponsors can all interpret consistently.

EXTENDED CASE STUDY · §43

The financial services mid-office deep dive

The Italian financial services mid-office automation pilot (deployment A, 14 months sustained usage) is worth a more detailed treatment because it embodies all 6 portability sub-dimensions. (1) Prompt portability: prompts written in a model-agnostic schema with model-specific syntax abstracted via a thin adapter. Concretely, the team wrote prompts as a Python dataclass with structured fields (instruction, context, examples, output-spec); a separate adapter layer rendered the dataclass into Claude-flavored XML tags or GPT-flavored system+user splits. Switching models was a config change, not a prompt rewrite. (2) State exportability: agent state stored as versioned Markdown files in a git repository.

Each agent session produced a "session.md" file with structured sections (task spec, decisions made, intermediate outputs, final result). Daily snapshots ensured 30-day history availability. The team could (and did, during incident review) grep across all sessions for patterns. (3) Evaluation portability: a deterministic evaluation suite of 240 task instances with reference answers; the suite ran on demand and produced reproducible scores.

Critical: the suite was designed to score the AGENT, not the model; switching models did not require re-validating the suite. (4) Grounding discipline: re-grounding every 25 turns via a structured "task-context-recap" block. The block contained: original task statement, current sub-task, invariant constraints (compliance rules, budget caps). (5) Dependency explicitness: an explicit "dependencies.yaml" listing all external APIs (3 internal services, 2 third-party APIs, 1 LLM provider) with fallback paths for each. The fallback paths had been exercised in disaster-recovery drills, not just documented. (6) Onboarding replicability: a "new-engineer-onboarding.md" file that a new engineer (different from the 4 who built the system) used to become productive within 4 days during the team-expansion in March 2026.

EXTENDED CASE STUDY · §44

The healthcare triage post-mortem

The Italian healthcare triage pilot (failure case D, shipped-then-dead) is worth a detailed treatment because its failure mode (no re-grounding discipline) is common and the post-mortem produced specific lessons. The pilot used Claude 3.5 Sonnet (a portable model with reasonable prompt portability) and stored state in human-readable JSON (reasonable state exportability). On the prompt and state sub-dimensions it scored L2-L3.

However, on grounding discipline it scored L0: the agent's task-context was set once at session start and never re-anchored. Clinical workflows changed in February 2026 with a new EU regulatory update mandating an updated triage protocol; the agent did not adapt. The agent kept producing triage recommendations against the obsolete protocol; clinicians noticed within 2 weeks and stopped trusting the agent; usage dropped to zero in 60 days.

The post-mortem identified that re-grounding could have been implemented in 2 days of engineering work. The general lesson: assume deployment context will change, even if today it looks stable.

FUTURE WORK · §45

Longitudinal portability decay

One of our most important open questions is whether portability maturity decays over time. The 3 successful pilots all scored L3+ at audit time, but our audit time was 9-14 months post go-live. Whether the L3+ score will be maintained at 24 months, 36 months, or beyond is unknown.

The decay mechanisms we hypothesize: (a) PROMPT DRIFT — prompts are tuned over time for the current model; without active maintenance, the tuning accumulates and prompt portability degrades. (b) STATE FORMAT EVOLUTION — state schemas evolve to support new features; without versioning discipline, old state becomes incompatible. (c) DEPENDENCY ACCRETION — new external dependencies get added over time; without rigorous manifest discipline, the manifest becomes incomplete. (d) ONBOARDING DRIFT — documentation goes stale as the system evolves; without active maintenance, the 5-day onboarding criterion fails. We are tracking the 3 successful pilots through 24-month observation and will report decay measurements in a v0.5 update.

References

  1. [1]
    MIT Sloan Management Review (2025), The AI Pilot-to-Production Gap.
  2. [2]
    Gartner (2025), Forecast: Enterprise AI Adoption Q4 2025.
  3. [3]
    BCG (2026), GenAI in the Enterprise H1 2026.
  4. [4]
    Deloitte (2025), State of Generative AI in the Enterprise H2 2025.
  5. [5]
    McKinsey (2026), The State of AI in Q1 2026.
  6. [6]
    Davenport T. & Mittal P. (2023), All in on AI, Harvard Business Review Press.
  7. [7]
    Westerman G. et al. (2024), Leading Digital Transformation, MIT Sloan.
  8. [8]
    Shapiro C. & Varian H.R. (1998), Information Rules: A Strategic Guide to the Network Economy, Harvard Business School Press, ch. 5-6.
  9. [9]
    Boehm B. (1981), Software Engineering Economics, Prentice-Hall.
  10. [10]
    Anthropic (2024-2025), Building Agents Cookbook.
  11. [11]
    Cognition Labs (2025), Devin Engineering Posts; Don't Build Multi-Agents, cognition.ai blog.
  12. [12]
    Karpathy A. (2024-2025), Production AI Talks and Notes.
  13. [13]
    Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems, arXiv:2604.02460. open ↗
  14. [14]
    Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
  15. [15]
    Cemri M. et al. (2025), Why Do Multi-Agent LLM Systems Fail? (MAST), arXiv:2503.13657v3, NeurIPS 2025 Datasets and Benchmarks Track. open ↗
  16. [16]
    Shinn N. et al. (2023), Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS.
  17. [17]
    Liu N. et al. (2024), Lost in the Middle, TACL.
  18. [18]
    Sumers T. et al. (2024), Cognitive Architectures for Language Agents, TMLR.
  19. [19]
    Cohen J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement 20:37-46.
  20. [20]
    Madani Lab (2026), Portability Pillar Specification v0.3.4 (23 artifacts, MIT release).
  21. [21]
    Madani Lab (2026), 47-Pilot EU Field Study Dataset (anonymized aggregates, MIT release).
  22. [22]
    Madani Lab (2026), Procurement Clause Templates v1.0 (enterprise contract language).
  23. [23]
    Anthropic (2025), Claude Sonnet 4.5 Technical Report.
  24. [24]
    OpenAI (2025), GPT-5 Technical Report.
  25. [25]
    Google DeepMind (2025), Gemini 2.5 Technical Report.
← back to all papersMadani Lab · WAB v0.3.4