Skills: Progressive Context Disclosure


Your system prompt is 5,000 tokens and growing. Every new feature makes your agent slower, more expensive, and dumber. There’s a better way.

📑 In This Article:


The Problem

You start with a simple agent. A few rules. A persona. It works.

Then requirements grow:

  • “Add database query syntax.”
  • “Add our coding standards.”
  • “Add the API documentation.”
  • “Add error handling patterns.”

Before you know it, you’ve created The Prompt Blob Monster—a 10,000-token system prompt that tries to do everything.

Imagine a football manager who insists on playing all 20 squad members simultaneously. Ronaldo, Rooney, Scholes, Ferdinand, Vidić, Giggs, Tevez—everyone on the pitch at once. It sounds powerful. In reality? Chaos. Players trip over each other. No one knows their position. The team is slower, not faster.

That’s what a bloated system prompt does to your agent.

The Enterprise RiskWhat Happens
💸 Cost ExplosionEvery request pays for tokens the agent doesn’t need right now.
🐢 Latency CreepMore tokens = slower first-token-time, especially at scale.
🧠 Context RotResearch shows LLMs lose reasoning quality in the “middle” of long contexts.
🎯 Instruction FogToo many rules = the model forgets which ones matter now.

The villain isn’t the LLM. It’s the architecture.

📊 The Reality Check:

What the Industry ShowsWhy It MattersSource
Simpler chunking (512 tokens) outperformed complex AI-driven methods in accuracy — FloTorch Study, Feb 2026Counterintuitively, loading less context often produces better results — supporting the case for strategic, selective disclosure.FloTorch RAG Benchmark
Average enterprise LLM spend hit $7M/year in 2025, projected $11.6M in 2026 — a16z, 2025As token costs scale into millions per year, progressive disclosure becomes one of the most direct levers for cost control.a16z Enterprise AI Report
70% of AI engineers have RAG in production or plan to deploy within 12 months — AI Engineering Survey, 2025With RAG becoming standard infrastructure, how you manage context loading — not whether you do it — increasingly determines quality.🔒 AI Engineering Survey · Community survey

The Concept

Skills are procedural memory—loaded on demand.

Instead of stuffing everything into one system prompt, you organize knowledge into discrete Skill files. The agent loads only what it needs, when it needs it.

💡 The Key Insight: Google’s Context Engineering guide defines this as Procedural Memory—“How-to” knowledge that’s retrieved just-in-time, not pre-loaded.

Think of it like a football manager’s bench.

Sir Alex Ferguson didn’t start every match with Ronaldo, Rooney, Tevez, Berbatov, and Scholes all on the pitch simultaneously. That’s chaos—too many creative players, no defensive structure. Instead, he read the game. Needed more pace against a tired defense? Bring on a fresh winger. Needed to hold a lead? Sub in a defensive midfielder.

The bench is your skill library. You don’t load every skill into context—just the one that matches the current situation. The 20-player squad exists, but only 11 play at a time.

This is Progressive Context Disclosure.


How It Works

The Context Window Is a Public Good

This is the key insight from Anthropic’s official Skills architecture: the context window is a public good. Your skill shares the context window with everything else — the system prompt, conversation history, other skills’ metadata, and the user’s actual request.

Not every token in your skill has an immediate cost. At startup, only the metadata (name and description) from all skills is pre-loaded. Claude reads SKILL.md only when the skill becomes relevant, and reads additional files only as needed. But being concise in SKILL.md still matters — once Claude loads it, every token competes with conversation history and other context.

💡 The Default Assumption: Claude is already very smart. Only add context Claude doesn’t already have. Challenge each piece of information: “Does Claude really need this explanation?”“Can I assume Claude knows this?”“Does this paragraph justify its token cost?”

The Skills Architecture

flowchart TD subgraph System["🎯 Always Active"] P["🎭 Persona"] end subgraph OnDemand["📚 Loaded When Needed"] S1["Skill: Git"] S2["Skill: Database"] S3["Skill: Code Review"] end U["👤 User: 'Review this PR'"] --> A["🤖 Agent"] A --> P A -.->|loads| S3 S3 --> R["✅ Review Complete"]

The SKILL.md Pattern

Each skill lives in its own folder with a SKILL.md file:

.agent/skills/
├── code-review/
│   ├── SKILL.md          # Main instructions (loaded when triggered)
│   ├── examples.md       # Usage examples (loaded as needed)
│   └── scripts/
│       └── lint.py       # Utility script (executed, not loaded)
├── database/
│   └── SKILL.md
└── git-operations/
    └── SKILL.md

Anatomy of a SKILL.md

---
name: code-review
description: Guidelines for reviewing pull requests
---

# Code Review Skill

## When to Apply
- User asks to review code, a PR, or a diff.

## Key Guidelines
1. Check for security vulnerabilities first.
2. Verify error handling is present.
3. Look for performance anti-patterns.

## Output Format
- Use inline comments for specific issues.
- Summarize overall assessment at the end.

The frontmatter (name, description) helps the agent decide when to load the skill.
The body contains the actual procedural knowledge.

Anthropic’s Progressive Disclosure Patterns

Anthropic defines three patterns for organizing skill content. Each pattern controls what gets loaded into context and when:

Pattern 1: High-Level Guide with References

Keep SKILL.md focused. Point to detailed files only when needed:

---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms,
  and merges documents. Use when working with PDFs.
---

# PDF Processing

## Quick start
Extract text with pdfplumber:
...

## Advanced features
**Form filling**: See [FORMS.md](FORMS.md)
**API reference**: See [REFERENCE.md](REFERENCE.md)
**Examples**: See [EXAMPLES.md](EXAMPLES.md)

Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed — zero context penalty until accessed.

Pattern 2: Domain-Specific Organization

For skills with multiple domains, organize content by domain to avoid loading irrelevant context:

bigquery-skill/
├── SKILL.md              (overview and navigation)
└── reference/
    ├── finance.md        (revenue, billing metrics)
    ├── sales.md          (opportunities, pipeline)
    ├── product.md        (API usage, features)
    └── marketing.md      (campaigns, attribution)

When the user asks about revenue, Claude reads SKILL.md, sees the reference to reference/finance.md, and reads just that file. The sales.md and product.md files remain on the filesystem, consuming zero context tokens.

Pattern 3: Conditional Details

Show basic content; link to advanced content only when the user’s task requires it:

# DOCX Processing

## Creating documents
Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).

## Editing documents
For simple edits, modify the XML directly.
**For tracked changes**: See [REDLINING.md](REDLINING.md)
**For OOXML details**: See [OOXML.md](OOXML.md)

Claude reads REDLINING.md or OOXML.md only when the user needs those features.

The Rules That Make It Work

Anthropic codifies two critical rules:

RuleWhy It Matters
500-Line LimitKeep SKILL.md body under 500 lines for optimal performance. Split into separate files when approaching this limit.
One-Level DeepAll reference files should link directly from SKILL.md. Avoid deeply nested references — Claude may only partially read files that are referenced from other referenced files.

The Loading Pattern

  1. User makes a request → “Review this pull request.”
  2. Agent checks available skills → Sees code-review matches.
  3. Agent loads the skill → SKILL.md content joins the context.
  4. Agent executes with specialized knowledge → Review follows guidelines.

The key: Most skills stay unloaded most of the time.


When to Use It

The Skill Threshold

Litmus Test: If knowledge is needed sometimes but not always, it’s a Skill.

Context TypeWhen NeededWhere It Goes
Core identity, valuesAlways🎭 Persona (System Prompt)
Procedures, workflowsSometimes📚 Skills (On-demand)
Facts, documentsPer-query📖 RAG (Retrieved)
Live system stateReal-time🔌 MCP (Connected)

Real-World Examples

Scenario❌ Blob Approach✅ Skills Approach
Multi-language support5,000 tokens of Python + TypeScript + Go syntax in every requestLoad only the language skill matching the current file
Database operationsAll SQL dialects in the promptLoad postgres.md or mysql.md based on detected connection
Code reviewReview guidelines always presentLoad code-review skill only when reviewing

The Token Math

Consider an agent with 10 specialized capabilities:

  • Blob approach: 10 × 500 tokens = 5,000 tokens every request
  • Skills approach: 500 tokens base + 500 tokens loaded = 1,000 tokens average

Result: 80% token reduction. Faster. Cheaper. Sharper focus.


Workflows & Feedback Loops

Skills aren’t just static instructions — they can encode multi-step workflows with built-in validation.

Checklist-Driven Workflows

For complex operations, Anthropic recommends providing a checklist that the agent can track through:

## Deployment Workflow
Copy this checklist and track your progress:
- [ ] Step 1: Run pre-flight checks
- [ ] Step 2: Build production bundle
- [ ] Step 3: Run integration tests
- [ ] Step 4: Deploy to staging
- [ ] Step 5: Verify health checks
- [ ] Step 6: Promote to production

Clear steps prevent the agent from skipping critical validation.

Feedback Loops

The most powerful skill pattern: Run → Validate → Fix → Repeat.

flowchart LR E["✍️ Execute"] --> V{"🧪 Validate"} V -->|Pass| D["✅ Done"] V -->|Fail| F["🔧 Fix"] F --> E

This pattern greatly improves output quality. The “validator” can be a script (python validate.py), a reference document (check against STYLE_GUIDE.md), or a built-in evaluation step.

Conditional Workflows

Guide the agent through decision points within a single skill:

## Document Modification
1. Determine the modification type:
   **Creating new content?** → Follow "Creation workflow" below
   **Editing existing content?** → Follow "Editing workflow" below

2. Creation workflow: Build document from scratch → Export
3. Editing workflow: Unpack → Modify XML → Validate → Repack

Evaluation-Driven Development

Anthropic recommends building evaluations before writing extensive documentation:

The 5-Step Process

flowchart TD G["1. 🔍 Identify Gaps"] --> E["2. 📝 Create 3 Scenarios"] E --> B["3. 📊 Establish Baseline"] B --> W["4. ✍️ Write Minimal Instructions"] W --> I["5. 🔄 Iterate Until Pass"] I -.->|"Refine"| W
  1. Identify gaps: Run Claude on representative tasks without a skill. Document specific failures.
  2. Create evaluations: Build three scenarios that test these gaps.
  3. Establish baseline: Measure Claude’s performance without the skill.
  4. Write minimal instructions: Create just enough content to address the gaps.
  5. Iterate: Execute evaluations, compare against baseline, and refine.

💡 The Key: This ensures you’re solving actual problems rather than documenting imagined ones. Evaluations are your source of truth for measuring skill effectiveness.

Enterprise Testing Tiers

For production deployments, Anthropic’s enterprise guidance defines three testing tiers:

TierWhat It TestsWhen
TriggeringDoes the skill load only when appropriate?Before merge
ExecutionDoes Claude follow the defined logic correctly?Before deploy
Output QualityIs the final artifact consistent across runs?Ongoing

Plus coexistence testing: Verify the new skill doesn’t cause regressions across your existing active skill set.


Advanced: Skills with Executable Code

Some skills go beyond instructions — they bundle utility scripts that Claude can execute directly.

Why Scripts Beat Generated Code

BenefitExplanation
More reliablePre-tested vs. generated on-the-fly
Save tokensNo need to include code in context
Save timeNo code generation step required
ConsistentSame execution across every use

The Execute vs. Read Distinction

Make clear in your instructions whether Claude should:

  • Execute the script (most common): “Run analyze_form.py to extract fields”
  • Read it as reference (for complex logic): “See analyze_form.py for the extraction algorithm”

For most utility scripts, execution is preferred — it’s more reliable and efficient. The script’s output consumes tokens, but the script itself does not.

MCP Tool References

If your skill uses MCP tools, always use fully qualified tool names to avoid “tool not found” errors:

Use the BigQuery:bigquery_schema tool to retrieve table schemas.
Use the GitHub:create_issue tool to create issues.

Format: ServerName:tool_name — Without the server prefix, Claude may fail to locate the tool.


📦 Case Study: The 92-Skill Library

In Vibe Product Design, we moved 90% of our prompt logic into a library of 92 specialized skills. This allows the same agent to design a PostgreSQL database, a Next.js frontend, or an AWS Lambda architecture without changing its system prompt.

Here is an actual skill file from the production system: skills/databases/SKILL.md.

---
name: database-design-postgresql
description: Principles for designing scalable PostgreSQL schemas
triggers:
  - "design a database"
  - "create schema"
  - "postgres"
---

# PostgreSQL Design Skill

## When to Apply
- The user requires a relational database design.
- The constraint is "scalability" or "data integrity".

## <instructions>
1.  **Use UUIDs for Primary Keys**:
    - Always use `uuid_generate_v4()` or logic for IDs.
    - Avoid sequential integers for public-facing resources.

2.  **Indexing Strategy**:
    - Create B-tree indexes on foreign keys.
    - Use GIN indexes for JSONB columns.

3.  **JSONB Usage**:
    - Use JSONB for "schemaless" data (e.g., config, metadata).
    - KEEP core relationships relational (Users, Orders).
</instructions>

## <template>
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    metadata JSONB DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
</template>

Why This Works:

  1. Frontmatter Triggers: The orchestrator only loads this when the user mentions “database” or “schema”.
  2. Structured Instructions: XML tags like <instructions> help the LLM parse rules distinct from user data.
  3. Concrete Templates: Providing a valid SQL block prevents syntax hallucinations.

🔗 From 92 to 889+: The community has scaled the skill pattern far beyond individual projects. Antigravity Awesome Skills is a curated collection of 889+ battle-tested agentic skills for Claude Code, Antigravity IDE, Cursor, and more. It includes role-based bundles (Web Wizard, Security Engineer, OSS Maintainer) that demonstrate skills at ecosystem scale — and you can install them with a single npx command.


Industry Applications

Skills work the same way across domains—load procedural knowledge only when needed:

IndustrySkills ExamplesWhen Loaded
🏦 Bankingfraud-investigation.md, loan-underwriting.md, kyc-verification.mdWhen task type detected
🛒 Retailreturn-processing.md, price-matching.md, shipping-policy.mdWhen customer request matches
🎓 Educationgrading-rubric.md, lesson-planning.md, accessibility-guidelines.mdWhen teaching scenario identified

The Pattern Repeats

🏦 Banking Agent: Customer asks about wire transfer → loads wire-transfer-procedure.md with compliance steps and limits. Doesn’t load mortgage skills.

🛒 Retail Agent: Customer wants to return an item → loads return-processing.md with policy rules. Doesn’t load inventory or shipping skills.

🎓 Education Agent: Student struggles with fractions → loads remedial-math.md skill with step-by-step teaching approach. Doesn’t load advanced calculus.

🔍 The Skills Trap

You extracted 50 skills from your monolith prompt. Your agent is now faster, cheaper… and calls the wrong skill 20% of the time.

Three ways skill architecture creates the problems it was meant to solve:

The TrapWhat You BuiltWhat Goes WrongThe Fix
🧩 Over-granular skills92 tiny skills, each covering one narrow taskThe skill-matching step becomes a classification problem harder than the original task. With 92 options, the agent picks the wrong skill 15-20% of the time. Wrong skill = wrong procedure = wrong output. The overhead of choosing outweighs the benefit of specializing.Group related skills into skill bundles (5-10 per bundle). Match at the bundle level first, then the skill level. “Database skills” → then “PostgreSQL” vs “MongoDB.” Two-stage matching is more accurate than 92-way single-stage matching.
🎯 Trigger misfiresKeyword-based triggers: “database” → loads database-skill.mdUser says “database” in the context of “our database is slow” (an operations question), not “design a database” (a design question). The trigger loads the wrong skill. Keyword matching doesn’t understand intent, it matches words.Use semantic triggers with intent classification, not keyword matching. Pre-classify queries into intent categories, then map intents to skills. “Our database is slow” → intent: troubleshooting → skill: database-troubleshooting.md, not database-design.md.
📅 Skill stalenessSkills written once, never updatedYour deploy-to-aws.md skill still references the old CDK v1 syntax. The skill is technically loaded correctly, but the instructions are outdated. The agent follows stale procedures confidently. No one remembers to update skills when the underlying tooling changes.Assign ownership and review dates to each skill. Add a last_reviewed field in frontmatter. If a skill hasn’t been reviewed in 90 days, flag it. Treat skills like code — they need maintenance, versioning, and testing.

💡 The Meta-Principle: Skills trade one complexity (prompt bloat) for another (skill management). The overhead is worth it at scale, but only if you invest in matching accuracy, intent understanding, and maintenance. A stale skill library is worse than a fat prompt — at least the fat prompt is current.


Key Takeaways

  • Skills = Procedural Memory: “How-to” knowledge loaded on demand, not pre-stuffed.
  • Context window is a public good: Every token in your skill competes with conversation history — be ruthlessly concise.
  • Folder structure > prompt engineering: Organize knowledge into files, not longer prompts.
  • Follow the 3 patterns: High-level guide with references, domain-specific organization, or conditional details.
  • Progressive disclosure reduces cost: Only pay for context you’re actually using.
  • Evaluate before you document: Build 3 test scenarios, establish a baseline, then write minimal skill instructions.
  • Feedback loops improve quality: Run → Validate → Fix → Repeat is the most powerful skill pattern.
  • Scalability by design: Add 100 skills without bloating every request.

What’s Next


References

  1. Google Cloud ResearchContext Engineering: Sessions & Memory (2025). Defines Procedural Memory as “How-to” knowledge distinct from Semantic Memory (facts).

  2. AnthropicCLAUDE.md Pattern. The inspiration for skill-based context organization.

  3. GalileoThe “Lost in the Middle” Phenomenon. Research on context degradation in long prompts.

  4. AnthropicSkill Authoring Best Practices (2025). Official guidelines for SKILL.md structure, progressive disclosure patterns, workflows, and degrees of freedom.

  5. AnthropicSkills for Enterprise (2025). Governance, security review, 6-stage lifecycle management, recall limits, and organizing skills at scale.

  6. AnthropicThe Complete Guide to Building Skills for Claude (2025). Comprehensive PDF guide covering skill structure, testing tiers, and production deployment patterns.

  7. Antigravity Awesome Skills889+ Agentic Skills for AI Coding Assistants (2025). Community-curated, battle-tested skill library demonstrating progressive disclosure at ecosystem scale.

❓ Frequently Asked Questions

What is progressive context disclosure for AI agents?

Loading only the knowledge an agent needs for the current task, rather than stuffing everything into the system prompt. This reduces costs, improves focus, and scales efficiently.

What is the SKILL.md pattern?

A structured markdown file that defines procedural knowledge with YAML frontmatter (name, description, triggers) and step-by-step instructions. Skills are loaded on-demand based on task context.

When should I use Skills vs RAG?

Use Skills for stable, procedural HOW-TO knowledge (e.g., deployment steps). Use RAG for dynamic, factual WHAT knowledge that changes frequently (e.g., product catalogs).

How do I evaluate the quality of my agent skills?

Build evaluations BEFORE writing extensive documentation. Identify gaps by running the agent without a skill, create 3 test scenarios, establish a baseline, then write minimal instructions and iterate. Anthropic recommends three testing tiers - triggering accuracy, execution correctness, and output quality.

💬 Join the Discussion

Got questions, feedback, or want to share your experience building AI agents? Join our community of architects and engineers.