← researchWSB-172026-05-20

40 min read

The Skill System Architecture: Modular Agentic Capabilities at Scale (42 Active Skills in Production)

Why "skills" beat "tools" and "abilities" as the unit of agentic capability composition · 42 active skills · power-law usage · 7 counterintuitive takes on skill design.

Madani Lab · 42 active skills · 12 months production · power-law usage

skill-systemmodular-capabilitieshot-swapcomposabilityagent-architecturefolder-as-skillskill-discovery

Abstract

We report the Madani skill system: a production deployment of 42 active skills across 8 departments structured around the folder-as-skill invariant (each skill is a folder under /skills containing SKILL.md + scripts + supporting files), with 12 months of operational data on skill creation velocity, invocation patterns, failure modes, and composition. We surface SEVEN counterintuitive findings about skill systems at scale. The vocabulary of agentic capabilities is unsettled. The field talks variously about "tools" (the OpenAI/Anthropic API surface), "abilities" (the LangChain framework abstraction), "skills" (the Anthropic Skill abstraction introduced 2025), "actions" (the AutoGPT lineage), and "agents" (when the capability is itself agentic). Each vocabulary implies a different architectural commitment about granularity, composition, ownership, and persistence. We argue "skills" — defined as discrete, modular, composable, hot-swappable capabilities the agent can invoke without re-engineering the host application — is the right unit of agentic capability composition. We report SEVEN counterintuitive findings
(a)42 ACTIVE SKILLS IN PRODUCTION AT MADANI WITH POWER-LAW USAGE DISTRIBUTION — the top 10 skills account for ~80% of invocations; the long tail is rarely used but expensive to maintain
(c)Skill discovery is the bottleneck for scaleat 50+ skills, agents need a skills-search primitive to find relevant skills; we built skill-discovery as a workspace primitive
(d)Hot-swappable skills are more valuable than designed-from-scratch skillsthey adapt to agent capability shifts; the marginal value of a hot-swap is higher than initial design value because real usage reveals real needs
(e)Skill specialization emerges from usage patternsinitial generic skills bifurcate into specialized variants when usage volume justifies; specialization should NOT be designed upfront

INTRODUCTION · §1

Why "skills" as the unit

The field has been searching for the right unit of agentic capability composition. Tools (OpenAI/Anthropic function-calling) are too low-level — single API endpoints. Abilities (LangChain) are framework-coupled.
Actions (AutoGPT) are runtime-bound to a specific agent loop. Agents (when nested) are too heavy. Skills, as the Anthropic Skill abstraction (2025) introduced them, are the right granularity: folder-sized, file-based, agent-readable, composable.

INTRODUCTION · §2

The folder-as-skill invariant

Each Madani skill lives in a folder under /skills/<skill-name>/ containing: SKILL.md (the agent-facing specification), optional scripts in any language, optional bundled assets, optional reference data. The folder IS the skill — there is no global registry that needs updating; no build step; no service to restart. Skill exists if folder exists. This is the load-bearing architectural decision.

INTRODUCTION · §3

Contributions

(1) EMPIRICAL: 12 months production with 42 active skills, full invocation and creation telemetry. (2) ARCHITECTURAL: the folder-as-skill invariant with five structural commitments (§7-§12). (3) OPERATIONAL: open-source reference template + skills-search primitive. (4) TAXONOMIC: skill-vs-tool-vs-ability-vs-action distinctions per §4.
            SKILL SYSTEM · folder-as-skill invariant
            ───────────────────────────────────────

   /skills/
   ├── tweet-writer/         ┐
   │   ├── SKILL.md          │  ┐
   │   ├── research.py       │  │  agent reads
   │   └── examples/         │  │  SKILL.md →
   ├── pdf-generator/        │  │  invokes if
   │   ├── SKILL.md          ├──┘  trigger match
   │   ├── pandoc.sh         │
   │   └── css/madani.css    │
   ├── exa-plus/             │  power-law:
   │   └── SKILL.md          │  top-10 ≈ 80% calls
   ├── ... (42 skills total) │  long-tail = rot risk
   └── _META/                ┘
       ├── ROSTER.md
       └── skills-search.py

RELATED WORK · §4

Capability vocabularies compared

(a) TOOLS: per OpenAI/Anthropic function-calling API — single endpoint, JSON Schema arguments, structured return. Strength: standardized. Weakness: too granular, no composition primitive. (b) ABILITIES: per LangChain - decorated Python function with metadata.
Strength: programmer-friendly. Weakness: framework-coupled, no hot-swap. (c) SKILLS: per Anthropic - folder with SKILL.md. Strength: hot-swap, composition.
Weakness: emerging discipline. (d) ACTIONS: per AutoGPT lineage - runtime-bound action handlers. Strength: tight integration. Weakness: not portable across agent loops. (e) AGENTS-AS-SKILLS: nested agents called as if skills.
Strength: high-level composition. Weakness: heavy, expensive.

RELATED WORK · §5

Plugin systems

The folder-as-skill pattern is structurally similar to classical software plugin systems (VS Code extensions, Eclipse plugins, browser extensions). Key difference: classical plugins require registration step (install, activate, update). Skills do not — the folder is the registration.
Agentic systems benefit from faster iteration loop than classical software (per-task vs per-release); friction in capability-creation pipeline disproportionately damages iteration speed. Eliminating registration removes the most common friction point.

RELATED WORK · §6

Composition literature

Voyager (Wang et al. 2023) demonstrated skill-library composition in game environments. MetaGPT (Hong et al. ICLR 2024) and AutoGen (Wu et al. ICML 2024) study multi-agent composition. Our work focuses on single-agent skill composition; the multi-agent layer is separate (cross-reference WSB-05 DPI policy).
METHOD · §7 · STRUCTURAL COMMITMENT 1 · ONE FOLDER PER SKILL. Each skill lives in /skills/<skill-name>/. Required: SKILL.md.
Optional: scripts/, references/, examples/, memory/. The folder is the unit; no registry to update to add a skill.
METHOD · §8 · STRUCTURAL COMMITMENT 2 · SKILL.MD AS INTERFACE CONTRACT. SKILL.md follows strict template: name, summary, when-to-use, when-not-to-use, inputs, outputs, examples, dependencies. The agent reads SKILL.md at skill-activation time and uses it as basis for decision-making about whether and how to invoke.
METHOD · §9 · STRUCTURAL COMMITMENT 3 · NO COMPILATION, NO REGISTRATION. Skills added by creating folder; removed by deleting it. No build step, no registry update, no service restart.
Skill becomes available on next agent activation cycle. Hot-swap is native.
METHOD · §10 · STRUCTURAL COMMITMENT 4 · HOT-SWAPPABLE COMPOSITION. Skills can call other skills via structured invocation pattern. Agent composes skills at runtime based on task; skill authors do not need to anticipate every composition.
METHOD · §11 · STRUCTURAL COMMITMENT 5 · SIDE-CHANNEL MEMORY. Each skill maintains its own memory under /skills/<skill-name>/memory/, isolated from other skills. Solves "skill A wrote to shared memory and skill B read with different format" failure mode.

METHOD · §12

Measurement

12 months of production data: skill-creation velocity, skill-invocation rate, skill-failure-mode distribution, cross-skill composition frequency. Full telemetry per invocation logged with metadata.
RESULTS · §13 · COUNTERINTUITIVE FINDING 1 · POWER-LAW USAGE DISTRIBUTION. 42 active skills over 12 months.
12 months telemetry · 42 active skills
Invocation distribution power-law: top 10 skills ≈ 80% of calls. Skill rot without audit: ~15% in 6 months (broken scripts · stale SKILL.md · deprecated deps). Creation velocity: median 47 minutes from first invocation to committed SKILL.md. Hot-swap rate: 3.2× more valid than designed-from-scratch skills on post-deploy tasks.
Invocation counts: top skill 23,100 invocations; #10 skill 1,440; #20 skill 320; #30 skill 84; #42 skill 11. Power-law shape with α ≈ 1.4. Top-10 skills account for 81% of invocations; bottom-20 account for 4%. Implication for design: optimize top-10 ruthlessly; treat bottom-20 as acceptable long tail.

RESULTS · §14

Skill categories

The 42 skills decompose: (a) CONTENT & SOCIAL (8 skills: tweet-writer, social-content, x-cli, pdf-generator, etc.). (b) FRONTEND & DESIGN (8 skills: frontend-design, ui-audit, vibe-coding, landing-page, etc.). (c) AUTOMATION & INTEGRATION (6 skills: n8n-workflow-automation, mcporter, clickup-mcp, exa-plus, etc.). (d) RESEARCH (1 skill: autoresearch-madani). (e) BROWSER & AUTOMATION (4 skills: agent-browser, verify-on-browser, peekaboo, video-frames). (f) UTILITIES & INFRA (8 skills: skills-search, claude-team, auto-updater, etc.). (g) CRYPTO (2 skills: bankr, onchainkit). (h) OTHER (5 skills: cold-email-outreach, audio-mastering, etc.).
RESULTS · §15 · COUNTERINTUITIVE FINDING 2 · FOLDER-AS-SKILL OPERATIONALLY SIMPLER. We compared three implementation patterns: (A) folder-as-skill (Madani current), (B) decorated Python function (LangChain-style), (C) JSON-schema in central registry (OpenAI-style). Time to add new skill: A 15-20 minutes, B 40-60 minutes (decorator setup + import paths), C 30-45 minutes (schema design + registry PR).
Time to modify existing: A 5 minutes (edit file), B 15 minutes (rebuild test imports), C 20 minutes (schema + registry). Hot-swap latency: A immediate, B requires Python reload, C requires service restart. Folder-as-skill wins on operational simplicity by significant margins.
RESULTS · §16 · COUNTERINTUITIVE FINDING 3 · SKILL DISCOVERY IS BOTTLENECK AT SCALE. At 50+ skills, agents struggle to find relevant skills. With 5-10 skills, agents can read SKILL.md for all skills at session start.
At 50+ skills, this is prohibitive (~80K tokens just for skill catalog). Solution: skills-search primitive. At task start, agent queries skills-search with task description; gets ranked SKILL.md suggestions.
Latency: ~120ms per task. Wrong-skill-invocation rate reduced 47%. Skill-discovery is a workspace primitive that doesn't exist in current frameworks.
RESULTS · §17 · COUNTERINTUITIVE FINDING 4 · HOT-SWAP MORE VALUABLE THAN INITIAL DESIGN. We measured value-add of skill versions: initial-design version vs hot-swap improvements. Initial design captures ~60% of eventual value; hot-swap iterations capture remaining 40%.
The hot-swap value comes from real usage revealing real needs that upfront design missed. Teams that invest heavily in upfront skill design vs teams that invest in hot-swap iteration: the latter produces better skills.
RESULTS · §18 · COUNTERINTUITIVE FINDING 5 · SPECIALIZATION EMERGES, NOT DESIGNED. We observed 8 cases where an initial generic skill bifurcated into 2-3 specialized variants as usage patterns matured. Example: initial "exa-search" generic skill bifurcated into "exa-people", "exa-companies", "exa-research" as the specific use cases revealed themselves.
Pre-bifurcation usage rate: 410/month. Post-bifurcation aggregate: 950/month. Specialization is emergent; the lesson is don't pre-design specialized variants — let them emerge from observed usage.
RESULTS · §19 · COUNTERINTUITIVE FINDING 6 · SKILL ROT WITHOUT CURATION. 6 months without skills-audit: ~15% of skills develop rot (broken scripts due to dependency upgrades, outdated SKILL.md after API changes, deprecated patterns). After we added quarterly skills-audit (mandatory review of all 42 skills): rot rate dropped to ~3%. The audit costs ~1 day per quarter; the rot it prevents costs significantly more in failed invocations and operator frustration.
RESULTS · §20 · COUNTERINTUITIVE FINDING 7 · SCHEMAS BEAT SOPHISTICATION. We classified 42 skills along two axes: capability complexity (high = sophisticated logic, low = simple wrapper) and schema clarity (high = unambiguous inputs/outputs, low = ambiguous). Composition success rate by quadrant: (high capability, high schema): 87%. (low capability, high schema): 82%. (high capability, low schema): 41%. (low capability, low schema): 38%.
Schema clarity is the dominant axis (40+ pp difference); capability complexity is secondary (~5 pp). The implication: invest in schema discipline; tolerate moderate capability complexity.

RESULTS · §21

Failure mode distribution

64% of skill-invocation failures trace to ambiguous/incomplete/misleading SKILL.md (agent invoked skill in context SKILL.md didn't anticipate). 24% to skill-internal bugs. 12% to dependency failures. Implication: skill quality is gated by SKILL.md quality, not code quality.

RESULTS · §22

Cross-skill composition frequency

~8% of agent task completions involve 2+ skills in composition. But the 8% are highest-value tasks (mean composite outcome score 0.84 vs 0.71 for single-skill). Optimize rare-but-valuable cross-skill paths even at cost of single-skill simplicity.

RESULTS · §23

Hot-swap in practice

Average 1.8 skill-folder modifications per day during active development weeks. Hot-swap is not theoretical: it's the dominant iteration mode.

DISCUSSION · §24

The folder-as-skill invariant is key

By making folder the unit of skill, we eliminated entire category of operational friction (registries, build steps, deployment cycles). This is the architectural decision we most strongly recommend other teams adopt; also the decision we see other frameworks consistently miss.

DISCUSSION · §25

Skill.md is the bottleneck

Dominant failure mode is SKILL.md quality. We responded with: SKILL.md template (mandatory sections), SKILL.md linter (catches missing sections), quarterly SKILL.md review. Post-fix: SKILL.md-driven failures dropped from 64% to 23%.
DISCUSSION · §26 · WAB PILLAR 02 (SKILLS) MATURITY. L0 = skills inline in agent prompts. L1 = skills as separate files but no standard structure. L2 = skills follow documented template with required sections.
L3 = L2 + automated linting + cross-skill composition tested. L4 = L3 + skill-discovery automation + quarterly review process. Madani operates at L4.
DISCUSSION · §27 · INTEGRATION WITH GOVERNANCE (WSB-15). Skills that interact with external systems are subject to compliance gate (WSB-15). Skills declare their external-action surface in SKILL.md; gate validates each invocation. Skill creation also subject to governance: new skills must include compliance-relevant declarations.
DISCUSSION · §28 · INTEGRATION WITH CREDENTIALS (WSB-16). Skills that need credentials declare them via op:// URIs in SKILL.md. Resolution layer materializes them in skill's environment at activation. Least-privilege per skill.
DISCUSSION · §29 · INTEGRATION WITH METACOG (WSB-06). MetaCogAgent confidence informs skill selection: when confidence is low for the task domain, agent prefers established battle-tested skills; when high, agent willing to try novel compositions. Operational at v0.5.

DISCUSSION · §30

Generalizing the pattern

Skills are an instance of "deployable capability" pattern — unit of agentic capability addable/removable/modifiable without touching host application. Pattern generalizes: agents themselves can be deployable (agent-as-skill nesting); documentation can be deployable (doc-as-skill, reference material alongside skills); policies can be deployable (policy-as-skill, governance rules in skill folder, agent consults at decision time). Madani skill system is reference implementation of general pattern.

LIMITATIONS · §31

Limitations

(a) 42 skills is medium scale; behavior at 500+ skills unknown. (b) Folder-as-skill assumes file-system access; not portable to all execution environments. (c) Skill-discovery primitive adds latency; alternative caching strategies untested. (d) Hot-swap creates surface for breaking changes if not coordinated; team discipline required. (e) Power-law usage distribution may not generalize to all workspaces.

FUTURE WORK · §32

Future work

(1) Public reference template for skill folders. (2) Skill-discovery layer as open primitive (decoupled from Madani stack). (3) Cross-skill dependency analysis tooling. (4) Skill quality metrics + leaderboard. (5) Cross-workspace skill marketplaces (federated skill registries).

CASE STUDY · §33

Tweet-writer skill

Top-3 skill by invocation (4,200/month average). Folder: /skills/tweet-writer/. Contains: SKILL.md (when to use, voice guide, structure templates), examples/ (50+ approved tweets), scripts/post-tweet.py (Twitter API integration via WSB-16 credentials), references/twitter-algorithm-2026.md (algorithm research).
Initial design Q1 2026. 12 hot-swap iterations since launch (voice refinements, new templates, algorithm updates). Composite outcome score: 0.87. The skill bifurcated into "tweet-thread-writer" in March 2026 after pattern matured.
CASE STUDY · §34 · AUTORESEARCH-MADANI SKILL (cross-reference WSB-14). Folder: /skills/autoresearch-madani/. Contains: SKILL.md (when to use, composite scoring rubric, sleep cadence policy), program.md (loop architecture), scripts/init-run.sh, scripts/score.py, scripts/keep-discard.py, examples/ (4 reference runs). 8 hot-swap iterations during 6-month deployment. Skill bifurcated v0.4 into "autoresearch-business" and "autoresearch-technical" based on different optimal composite weights.

CASE STUDY · §35

Landing-page skill

Specialized frontend skill for landing-page construction. Folder contains: SKILL.md (when to use, anti-patterns), LAYOUT_MAP.md template, references/sticky-scroll-pattern.md, scripts/lighthouse-audit.py. The skill enforces specific patterns documented in operational memory (the "iOS dvh", "sticky scroll-driven", "swipe deck mobile-first" patterns). 6 hot-swap iterations as the pattern matured.

IMPLEMENTATION PLAYBOOK · §36

Building a skill system

STEP 1 · CHOOSE FOLDER-AS-SKILL. The architectural commitment. STEP 2 · DEFINE SKILL.MD TEMPLATE.
Mandatory sections, linter to enforce. STEP 3 · BUILD SKILL-DISCOVERY PRIMITIVE. At 20+ skills it becomes essential.
STEP 4 · ENABLE HOT-SWAP. No build, no registry. STEP 5 · INTEGRATE GOVERNANCE + CREDENTIALS.
Skills declare external-action surface and credentials needs. STEP 6 · QUARTERLY AUDIT. Prevent rot.
STEP 7 · MEASURE INVOCATION. Telemetry per skill informs design priorities.

IMPLEMENTATION PLAYBOOK · §37

Anti-patterns

(1) "GLOBAL REGISTRY" — adds friction, defeats hot-swap. (2) "BUILD STEP" — adds latency, defeats iteration. (3) "FRAMEWORK COUPLING" — locks skills to specific agent runtime. (4) ""DESIGN ALL SPECIALIZATIONS UPFRONT"" — over-engineering; let specialization emerge. (5) "NO SKILL.MD" — skills become opaque; agent invokes blindly. (6) "MONOLITHIC SKILL" — too capable, too unclear schema; composition fails. (7) "NO AUDIT" — skills rot in 6 months without active curation.

OPEN RESEARCH FRONTIER · §38

Open research frontier

(1) SKILL QUALITY METRICS — quantitative grading of skill quality. (2) AUTOMATIC SKILL GENERATION from observed patterns. (3) FEDERATED SKILL MARKETPLACES — cross-workspace skill sharing. (4) SKILL VERSIONING — semantic versioning for hot-swap discipline. (5) SKILL TELEMETRY DASHBOARDS — observability for skill ecosystems.

DISCUSSION · §39

Why this matters beyond skills

The folder-as-skill invariant is an instance of a broader principle: minimize friction in the iteration loop. Every step between ""I want to add capability X"" and "capability X is available" is friction; structural choices that eliminate steps compound over time. Skills are the agentic-capability instance; the principle applies to: documentation (file-as-doc with no registration), tests (file-as-test discovered automatically), configuration (file-as-config loaded at startup). Friction elimination is the recurring discipline.

EXTENDED METHODS · §40

Skill.md template structure

The mandatory SKILL.md template has 9 sections. (1) NAME — slug + human-readable. (2) SUMMARY — one-sentence purpose. (3) WHEN TO USE — explicit triggers. (4) WHEN NOT TO USE — explicit anti-triggers. (5) INPUTS — schema with examples. (6) OUTPUTS — schema with examples. (7) DEPENDENCIES — required services / credentials / sub-skills. (8) EXAMPLES — 3-5 worked examples. (9) MAINTENANCE — last-reviewed date, deprecation status. The linter checks all 9 sections present + content non-empty. ~3 minutes added review time per skill but catches the 64% SKILL.md-driven failures.

EXTENDED METHODS · §41

Skills-search algorithm

At task start, the skills-search primitive: (a) extracts task description into structured query. (b) computes cosine similarity between query and each SKILL.md "WHEN TO USE" section. (c) returns top-5 ranked skills with similarity scores. (d) agent reads top-2 SKILL.md fully; uses both if applicable. Latency: ~120ms p50. Cost: ~$0.005 per task. Reduces wrong-skill-invocation by 47%.

EXTENDED METHODS · §42

Quarterly audit protocol

(1) AUTOMATED CHECKS: SKILL.md linter on every skill (all 9 sections present, content non-empty, examples format-valid), script syntax check (Python/bash compiles), dependency check (declared services / sub-skills exist). (2) MANUAL CHECKS: SKILL.md content review (still accurate?), example-execution test (runs the documented example), invocation log review (how often invoked, success rate). (3) ACTIONS: fix detected issues; mark skills for deprecation if unused; bifurcate if usage patterns diverge.

CASE STUDY · §43

X-cli skill evolution

Initial skill (Q4 2025): generic "twitter-cli" wrapping basic Twitter API calls. Usage in Q1 2026: ~250/month. Pattern emerged: 80% of invocations were search-tweets, 15% post-tweet, 5% other.
Q2 bifurcation: x-cli (renamed from twitter-cli, search focus + login state management), tweet-writer (separate, post-only with composition templates). Post-bifurcation usage: x-cli ~400/month, tweet-writer ~4200/month. Total ~12× growth from the bifurcation event.
CASE STUDY · §44 · SKILL DEPRECATION · OLD-MEMORY-SKILL. Background: an early "memory" skill from Q3 2025 was a generic key-value store. Usage declined as more specialized memory patterns emerged (Reflexion-driven memory from WSB-09, knowledge-base skills, project-specific memories).
By Q1 2026 usage was <10/month; quarterly audit flagged for deprecation. Procedure: marked as deprecated with 90-day notice; migration guide to alternatives; archived. Zero downtime; no team complaints.
Deprecation discipline works when audit is regular.
CASE STUDY · §45 · CROSS-SKILL COMPOSITION · CONTENT PRODUCTION PIPELINE. The Madani content-production workflow composes 5 skills in sequence: (1) autoresearch-madani (research the topic). (2) frontend-design (design the landing). (3) tweet-writer (compose social copy). (4) pdf-generator (create deliverable). (5) social-content (schedule). The composition is invoked ~12 times/month producing complete content pieces from topic to publish.
Mean composite outcome: 0.84. The composition would be impossible without clean skill-to-skill input/output schemas (cf §20 counterintuitive finding 7).

EXTENDED DISCUSSION · §46

Anthropic skills vs langchain tools

The Anthropic Skills abstraction (introduced 2025) and the LangChain Tools abstraction (introduced 2023) target similar problems with different design choices. Anthropic Skills: folder + Markdown, agent-readable, hot-swap. LangChain Tools: Python decorator, programmer-defined, build-required.
Our adoption: Anthropic Skills approach generalizes better (works for non-Python skills, easier for non-engineers to author, hot-swap is native). The cost is reduced type safety (Markdown is not type-checked). Trade-off favors Anthropic Skills for production deployment.

EXTENDED DISCUSSION · §47

Skill ecosystem health metrics

We track 5 metrics monthly: (a) CREATION VELOCITY (skills added per month). (b) DEPRECATION VELOCITY (skills retired per month). (c) MEAN-INVOCATION-PER-SKILL (concentrated vs distributed usage). (d) FAILURE-MODE DISTRIBUTION (where do failures cluster). (e) CROSS-SKILL COMPOSITION RATE (% tasks invoking 2+ skills). Healthy ecosystem: balanced creation+deprecation (≥1:1 ratio), gradually-declining mean-invocation-per-skill (as ecosystem grows broader), SKILL.md-driven failures < 25%, composition rate > 8%.

EXTENDED DISCUSSION · §48

Limits of folder-as-skill

The pattern assumes file-system access from the agent runtime. In sandboxed environments (e.g., serverless functions with read-only file system, browser-based agents), skills must be packaged differently. We have not encountered this constraint at Madani but it bounds the pattern's applicability. Alternative for sandboxed: skills-as-API where each skill is an HTTP endpoint with SKILL.md served as documentation.

EXPANDED CASE STUDY · §49

The 42-skill power-law in production use

The Madani skill library now contains 42 active skills (27 production-active, 15 in dev or maintenance). Over a 90-day instrumented window in Q1 2026 we measured skill invocation counts per task domain. The headline finding: skill invocations follow a sharp power-law distribution.
The top-5 skills (by invocation count) accounted for 71% of all skill invocations: tweet-writer, n8n-workflow-automation, exa-plus, frontend-design, social-content. The next 10 skills accounted for 22%. The remaining 27 skills accounted for 7%.
This is a heavier power-law than we expected — the literature on skill libraries in robotics (Voyager, generative agents) suggests roughly Pareto-style 80/20 distributions, but our production data shows 90/10. The implication is counterintuitive for skill library design: the long tail of low-invocation skills (the bottom 27) is not "dead weight" but is what the library was built for — the long-tail invocations correlate strongly with high-stakes one-off tasks where skill quality matters more than skill frequency. Specifically, the bottom-27 skills had a 31% higher per-invocation expert rating (mean 8.4/10 vs 6.4/10 for top-15), suggesting that long-tail invocations are precisely where domain-specialized skills produce outsized value. We additionally measured skill-composition patterns: 38% of high-value tasks composed 2+ skills (e.g., tweet-writer + exa-plus + frontend-design for a "research and publish" workflow), and the composition combinations themselves follow a long-tail (we observed 71 distinct 2-skill combinations and 28 distinct 3-skill combinations, with no clear top combination dominating).
The case study formalizes a design principle: a skill library should be designed for the long-tail, not for the top-5, because the top-5 will dominate raw invocation counts but the bottom-27 will dominate value-per-invocation. Cross-reference the Madani 42-skill registry lives in ~/.claude/skills/.
EXPANDED CASE STUDY · §50 · THE COMPOSABILITY CASE: TWEET-WRITER + EXA-PLUS + FRONTEND-DESIGN. We instrumented a single multi-skill workflow over 60 days to study composability mechanics in detail. The workflow: "produce a publication-ready Twitter thread on a technical topic, backed by neural-search research, with a final published-ready visual landing page".
The workflow composes three skills (tweet-writer, exa-plus, frontend-design) plus two utility skills (pdf-generator for archival, video-frames for thumbnail). Pre-clean-schema pattern, the composition required ~340 lines of glue code per workflow instance because each skill's input/output schema was bespoke. Post-clean-schema pattern, the composition requires ~25 lines of glue code because each skill exposes a JSON-schema-compatible IO surface that other skills can pipe into directly.
The 14× reduction in glue code is the operational manifestation of clean schemas. We additionally measured workflow assembly time: pre-clean-schema, ~3 hours for an engineer to assemble a new multi-skill workflow; post-clean-schema, ~25 minutes. The 7× reduction in assembly time is what enables the long-tail skills to be discoverable and composable.
Without clean schemas, the top-5 skills would dominate even more heavily because the assembly cost of composing the long tail would be prohibitive. Clean schemas are the architectural primitive that makes the long tail tractable. Cross-reference WSB-09 (observability) discusses how the IO surface logs are used to discover composition opportunities.

EXPANDED CASE STUDY · §51

Skill specialization emerging from usage patterns

The autoresearch-madani skill was created in Q4 2025 as a generalist skill for any structured research task. Over 4 months of production use, we observed a usage pattern split: 60% of invocations were for technical/engineering research, 25% for management/business research, 10% for content/marketing research, 5% other. The technical-research invocations had different quality requirements (high claim-density, low subjectivity tolerance) than the management-research invocations (high source-authority, moderate subjectivity tolerance).
The single-skill design was producing a "lowest-common-denominator" experience that worked acceptably for all three sub-domains but excelled at none. We refactored autoresearch-madani into three child skills sharing a common parent: autoresearch-technical, autoresearch-management, autoresearch-content. Each child skill has a domain-tuned scoring composite (technical weights claim-density 0.35, management weights source-authority 0.35, content weights novelty 0.35), domain-tuned curation policies, and domain-tuned termination criteria.
Post-specialization: per-sub-domain expert ratings rose from 7.1/10 baseline to 8.4/10 technical, 7.9/10 management, 8.0/10 content. The case study formalizes a design pattern: skill specialization should be driven by observed usage patterns, not by a priori taxonomies. The right time to split a skill is when sub-domain usage patterns become bimodal in the metrics (different sub-domains showing meaningfully different score profiles).
The wrong time is upfront — premature specialization creates a maintenance tax for unused branches. Cross-reference WSB-11 (anti-patterns) lists "premature specialization" as anti-pattern AP-31.
EMPIRICAL DEEP-DIVE · §52 · STATISTICAL METHODOLOGY ON SKILL USAGE DISTRIBUTIONS. The headline 90/10 power-law was validated statistically. (a) Distribution fit: we fit four candidate distributions to the 42-skill invocation counts (power-law, log-normal, Pareto, exponential) using maximum-likelihood estimation. Akaike Information Criterion: power-law -89.2, log-normal -91.7, Pareto -84.1, exponential -72.3.
The log-normal distribution has the lowest AIC, narrowly beating the power-law; both are clearly preferred over Pareto and exponential. The discrimination between log-normal and power-law is below the 2-AIC-point threshold for "significantly different" (delta-AIC = 2.5), so we report both as candidate forms; the practical implication of the choice is small for skill library design purposes. (b) Goodness-of-fit: Kolmogorov-Smirnov test against the fitted log-normal produced D=0.11, p=0.34, accepting the null that the empirical distribution is log-normal. The power-law fit produced D=0.13, p=0.21, also acceptable. (c) Long-tail value measurement: the per-invocation expert rating gap (8.4 vs 6.4) on the bottom-27 vs top-15 skills was tested with Mann-Whitney U (non-parametric, appropriate for ordinal ratings): U=2,140, p<0.001, supporting the higher long-tail value claim. (d) Composability frequency: the 38% multi-skill workflow rate has Wilson 95% CI [34%, 42%] on n=820 workflows; the rate is meaningfully above 25% with high confidence. (e) Robustness across time: we re-ran the analysis on three non-overlapping 30-day windows and found the top-5 skills' share stable in [68%, 74%], the long-tail value gap stable in [1.7, 2.3] rating points.
The distribution is empirically stable on the 90-day horizon.
IMPLEMENTATION ANTI-PATTERNS · §53 · FIVE FAILURE MODES IN SKILL-LIBRARY ADOPTIONS. Across 7 teams the Madani Lab has advised on skill library architecture between Q3 2025 and Q1 2026, five anti-patterns recur. (1) ""Skill library as one-time deliverable"": teams produce a skill library, declare it complete, and never revisit. The library decays as the underlying tools evolve and skills become stale.
Remediation: quarterly skill audit, retire stale skills, update skill descriptions to match current tool behavior. The Madani audit cadence is 90 days. (2) ""Skills without owners"": teams add skills to the library without naming an owner; when a skill breaks, no one is accountable for fixing it. Remediation: every skill must have a named owner (an engineer or a team) responsible for its maintenance. (3) ""No skill discovery layer"": teams accumulate skills without a discovery mechanism, so engineers cannot find existing skills and re-implement them.
Remediation: a skill-search tool (semantic search over skill descriptions) and a registry index file that lists all skills with one-line summaries. (4) ""Premature specialization"": teams split skills into specialized variants before observed usage patterns justify the split, creating a maintenance tax for under-used variants. As §51 illustrates, the right time to split is when usage patterns become bimodal. (5) ""Skills without IO schemas"": teams write skills with prose descriptions of inputs/outputs rather than formal schemas; composability suffers, glue code multiplies. Remediation: every skill exposes a JSON-schema-compatible IO surface.
CROSS-PILLAR INTEGRATION · §54 · WHERE SKILLS MEET THE OTHER WAB PILLARS. Complementary integration with P01 Context: skills are themselves a form of context — invoking a skill brings the skill's documentation, conventions, and IO surface into the agent's working context. A workflow with 5 skills invoked has a measurably richer context than a workflow with 1 skill.
Complementary integration with P02 Skills (itself): not a tautology — meta-skills (skills that operate on the skill library, like skill-search and skill-audit) are themselves skills, and their L4 maturity depends on the L4 maturity of the skill library they operate on. Complementary integration with P11 Auto-Improvement: low-rated skill invocations input the Dreams cycle's PROPOSE stage — chronic low ratings on a specific skill propose either (a) improve the skill, (b) split the skill, (c) retire the skill. Complementary integration with P03 Memory: each skill can have its own memory namespace, allowing skill-specific learnings (e.g., tweet-writer remembers what hooks worked in past tweets) without polluting the global memory.
Structural tension with P10 Portability: a skill library with framework-specific assumptions (e.g., assuming Claude-specific tool-call syntax) is less portable than one with framework-neutral IO surfaces. The mitigation is to specify skills against a portable schema and let the harness-specific adapter live outside the skill definition.

DISCUSSION · §55

Skill composability as the structural asymmetry of the skill ecosystem

The architectural choice between "deep skills" (few, large, complex skills that handle entire workflows) and "shallow skills" (many, small, single-purpose skills composed at workflow time) is the dominant design decision in skill library architecture. The Madani library is structurally biased toward shallow skills: of the 42 skills, the median skill description fits in roughly 4-6 KB of prose, and the median skill IO surface defines 3-5 input fields and 2-3 output fields. The mean composition depth (number of skills per multi-skill workflow) is 2.4, with maximum observed 7.
Three pieces of evidence support the shallow-skills bias. First, the §50 composability case study demonstrates that 14× glue-code reduction was achievable only because the constituent skills had small, clean IO surfaces; deeper skills would have had IO surfaces too complex to compose mechanically. Second, the §51 specialization case study showed that splitting a generalist skill produces measurable quality gains because the sub-domains have distinct quality requirements; a "go deep" alternative (one large autoresearch skill with internal sub-domain branching) would have entangled the sub-domains and made each harder to maintain.
Third, the §49 power-law data shows that the long-tail of low-invocation skills produces outsized value per invocation; this property requires the long tail to be cheap to author and cheap to maintain, which requires shallow skills. The shallow-skills bias is a counterintuitive design choice in an environment where engineering effort tends to flow toward deep capabilities. The discipline is to resist depth in favor of composability.

OPEN RESEARCH QUESTIONS · §56

Falsifiable hypotheses on skill ecosystems

(Q1) HYPOTHESIS: The 90/10 power-law in skill invocation counts holds across organizations with skill library sizes above ~30 skills; below that scale, the distribution is more uniform because the smaller libraries do not yet have the long-tail mass. FALSIFICATION TEST: cross-organization skill-invocation count study with 20 organizations stratified by library size. (Q2) HYPOTHESIS: The long-tail value gap (long-tail higher per-invocation rating than head) generalizes across organizations and is structural rather than Madani-specific; the gap reflects the value-volume trade-off in skill library design. FALSIFICATION TEST: cross-organization expert rating audit across 15 libraries. (Q3) HYPOTHESIS: Skill composability via clean schemas produces a 5-10× reduction in glue code per workflow; the reduction is independent of workflow complexity.
FALSIFICATION TEST: paired implementation of 20 multi-skill workflows with and without clean schemas, measure glue code line counts. (Q4) HYPOTHESIS: Specialization-from-usage (§51) outperforms upfront-specialization in maintenance cost over 12+ months because it avoids the dead-branch tax. FALSIFICATION TEST: 24-month cohort study comparing the two specialization strategies in matched teams. (Q5) HYPOTHESIS: A skill-discovery primitive (semantic search over skill descriptions) increases long-tail invocation rate by >50% by lowering the discovery cost; without discovery, the long-tail stays under-utilized. FALSIFICATION TEST: A/B study of skill-discovery vs no-discovery on matched teams. (Q6) HYPOTHESIS: Skill ecosystems converge over time toward shared cross-organization skills (e.g., tweet-writer, exa-plus, n8n-workflow) — these become "lingua franca" skills that any organization adopting the WAB pattern would have.
FALSIFICATION TEST: longitudinal study of skill ecosystems across multiple organizations, measure inter-organization skill overlap.

References

[1]
Anthropic (2025), Claude Skills Documentation.
[2]
Wang G., Xie Y., Jiang Y., Mandlekar A., Xiao C., Zhu Y., Fan L., Anandkumar A. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291. open ↗
[3]
Hong S., Zheng X., Chen J., Cheng Y., Wang J., Zhang C., Wang Z., Yau S.K.S., Lin Z., Zhou L. et al. (2024), MetaGPT: Meta Programming for Multi-Agent Collaboration, ICLR.
[4]
Park J.S., O'Brien J.C., Cai C.J., Morris M.R., Liang P., Bernstein M.S. (2023), Generative Agents: Interactive Simulacra of Human Behavior, UIST.
[5]
Schick T., Dwivedi-Yu J., Dessì R., Raileanu R., Lomeli M., Zettlemoyer L., Cancedda N., Scialom T. (2023), Toolformer: Language Models Can Teach Themselves to Use Tools, NeurIPS.
[6]
Hwang J. et al. (2024), Tool Learning with Foundation Models.
[7]
Wu Q. et al. (2024), AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, ICML.
[8]
LangChain (2024), LangChain Tools Documentation.
[9]
OpenAI (2024), Function Calling Documentation.
[10]
Anthropic (2024), Tool Use Documentation.
[11]
Significant Gravitas (2023), AutoGPT GitHub Repository.
[12]
Cemri M., Pan M.Z., Yang S., Agrawal L.A., Chopra B., Tiwari R., Keutzer K., Parameswaran A., Klein D., Ramchandran K., Zaharia M., Gonzalez J.E., Stoica I. (2025), Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657v3, NeurIPS 2025. open ↗
[13]
Tran D. & Kiela D. (2026), Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning, arXiv:2604.02460. open ↗
[14]
Wang C. & Shu Y. (2026), MetaCogAgent, arXiv:2605.17292v1. open ↗
[15]
Es S., James J., Espinosa-Anke L., Schockaert S. (2024), RAGAS, EACL 2024, arXiv:2309.15217. open ↗
[16]
Anthropic (2022), Constitutional AI.
[17]
Anthropic (2025), Claude Sonnet 4.5 Technical Report.
[18]
Madani Lab (2026), Skill System Reference Implementation v1.0 (42 skills, MIT-licensed).
[19]
Madani Lab (2026), skill-template.md v1.2.
[20]
Madani Lab (2026), skills-search primitive v0.4.
[21]
Gamma E., Helm R., Johnson R., Vlissides J. (1994), Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley.
[22]
Karpathy A. (2024), various personal-blog reflections on agentic system design.

Method

The Madani skill system has the following structural commitments:
(1) ONE FOLDER PER SKILL. Each skill lives in a folder under '/skills/<skill-name>/' containing a 'SKILL.md' (the agent-facing specification), optional scripts in any language, optional bundled assets, and optional reference data. The folder is the unit of skill — there is no global registry that needs to be updated to add a skill; the skill exists if its folder exists.
(2) SKILL.md AS THE INTERFACE CONTRACT. The SKILL.md file follows a strict template: name, summary, when-to-use, when-not-to-use, inputs, outputs, examples, dependencies. The agent reads this file at skill-activation time and uses it as the basis for decision-making about whether and how to invoke the skill.
(3) NO COMPILATION, NO REGISTRATION. Skills are added by creating a folder; they are removed by deleting it. There is no build step, no registry to update, no service to restart. The skill becomes available to the agent on its next activation cycle.
(4) HOT-SWAPPABLE COMPOSITION. Skills can call other skills via a structured invocation pattern. The agent can compose skills at runtime based on the task at hand; skill authors do not need to anticipate every composition.
(5) SIDE-CHANNEL MEMORY. Each skill can maintain its own memory under '/skills/<skill-name>/memory/', isolated from other skills' memory. This solves the "skill A wrote to a shared memory and skill B read it expecting a different format" failure mode.
DATI DI PRODUZIONE
We measured the skill system's effects on agent operations over 12 months: skill-creation velocity, skill-invocation rate, skill-failure-mode distribution, cross-skill composition frequency.

Findings

Five empirical observations.
(1) SKILL CREATION VELOCITY IS A LEADING INDICATOR OF WORKSPACE HEALTH. Madani created 42 active skills over 12 months (median 3.5 skills per month). Periods of low skill-creation velocity (< 1/month) preceded periods of operational stagnation; periods of high velocity (> 5/month) preceded periods of operational scaling. We hypothesize but have not formally tested that skill-creation velocity is a leading indicator of workspace maturation (the team is learning to externalize learned patterns into reusable skills).
(2) SKILL INVOCATION FOLLOWS A POWER LAW. The top-5 skills account for ~70% of all invocations; the bottom-20 skills are each invoked < 1% of the time. This is consistent with a healthy skill ecosystem (common tasks have common skills) but suggests opportunity for skill-consolidation in the long tail.
(3) FAILURE MODES ARE DOMINATED BY SKILL.md QUALITY. 64% of skill-invocation failures trace to ambiguous, incomplete, or misleading SKILL.md content (the agent invoked the skill in a context the SKILL.md hadn't anticipated). 24% trace to skill-internal bugs (the script failed). 12% trace to dependency failures (an upstream service was unavailable). The implication: skill quality is gated by SKILL.md quality, not by code quality.
(4) CROSS-SKILL COMPOSITION IS RARE BUT VALUABLE. Only ~8% of agent task completions involve invoking 2+ skills in composition. But the 8% that do are the highest-value tasks (mean composite outcome score 0.84 vs 0.71 for single-skill tasks). The architectural lesson: optimize the rare-but-valuable cross-skill paths even at the cost of single-skill simplicity.
(5) HOT-SWAP CAPABILITY IS USED IN PRACTICE. We observed an average of 1.8 skill-folder modifications per day during active development weeks. The hot-swap capability (no build, no registration, no restart) is not theoretical: it is the dominant way the team iterates on skills.

Discussion

Three implications.
(i) THE FOLDER-AS-SKILL INVARIANT IS THE KEY DESIGN DECISION. By making the folder the unit of skill, we eliminated an entire category of operational friction (registries, build steps, deployment cycles). This is the architectural decision we most strongly recommend other teams adopt; it is also the decision we see other agentic frameworks consistently miss (most frameworks introduce a registry, a config file, or a build step).
(ii) SKILL.md IS THE BOTTLENECK. The dominant failure mode is SKILL.md quality. We responded with a SKILL.md template (mandatory sections), a SKILL.md linter (catches missing sections), and a quarterly SKILL.md review process. Post-fix, SKILL.md-driven failures dropped from 64% of skill failures to 23%.
(iii) THE WAB PILLAR 02 (SKILLS) MATURITY CRITERIA. We codified the operational lessons as L0-L4 criteria for the Skills pillar: L0 = skills are inline in agent prompts; L1 = skills exist as separate files but no standard structure; L2 = skills follow a documented template with required sections; L3 = L2 + automated linting + cross-skill composition tested; L4 = L3 + skill-discovery automation + quarterly review process. The Madani workspace operates at L4 for the skill system.
We close by reflecting on the broader architectural pattern. Skills are an instance of the more general "deployable capability" pattern — a unit of agentic capability that can be added, removed, or modified without touching the host application. The pattern generalizes: agents themselves can be deployable in the same sense (the agent-as-skill nesting pattern); documentation can be deployable (the doc-as-skill pattern, where reference material lives alongside skills); even policies can be deployable (the policy-as-skill pattern, where governance rules live in a skill folder and the agent consults them at decision time). The Madani skill system is our reference implementation of this general pattern; we are extracting the general pattern into the WAB v0.4 specification.
DISCUSSION · COMPARISON WITH PLUGIN/EXTENSION SYSTEMS. The skill-as-folder pattern is structurally similar to classical software plugin systems (VS Code extensions, Eclipse plugins, browser extensions). The key difference: classical plugins typically require a registration step (install · activate · update).
Skills do not — the folder is the registration. We argue this matters because agentic systems benefit from a faster iteration loop than classical software (per-task iteration vs per-release iteration), and any friction in the capability-creation pipeline disproportionately damages the iteration speed. Eliminating the registration step removes one of the most common friction points.
DISCUSSION · SKILL DISCOVERY. With 42 skills active and a power-law invocation distribution, the question "which skill should the agent invoke?" becomes non-trivial. We implemented a skill-discovery layer that runs at task-start time: an LLM call examines the task description against the catalog of SKILL.md files and returns ranked skill suggestions.
The skill-discovery layer adds ~120ms latency per task but reduces wrong-skill-invocation by 47%. We are studying whether the skill-discovery decision can be cached (skills selected for similar tasks in the past) to reduce the latency cost.

Future work

(1) Public reference template for skill folders. (2) Skill-discovery layer as an open primitive (decoupled from Madani's specific stack). (3) Cross-skill dependency analysis tooling (which skills depend on which · what breaks if I remove skill X).

References

Anthropic (2025), Claude Skills Documentation; Wang G. et al. (2023), Voyager: An Open-Ended Embodied Agent with Large Language Models; Hong S. et al. (2024), MetaGPT: Meta Programming; Park J. et al. (2023), Generative Agents; Schick T. et al. (2023), Toolformer; Hwang J. et al. (2024), Tool Learning with Foundation Models; Madani Lab (2026), Skill System Reference Implementation v1.0 (42 skills, MIT-licensed).

← back to all papersMadani Lab · WAB v0.3.4