Skills: Progressive Context Disclosure
Your system prompt is 5,000 tokens and growing. Every new feature makes your agent slower, more expensive, and dumber. There’s a better way.
📑 In This Article:
- The Problem
- The Concept
- How It Works
- When to Use It
- Workflows & Feedback Loops
- Evaluation-Driven Development
- Advanced: Skills with Executable Code
- 📦 Case Study: The 92-Skill Library
- Industry Applications
- Key Takeaways
- References
The Problem
You start with a simple agent. A few rules. A persona. It works.
Then requirements grow:
- “Add database query syntax.”
- “Add our coding standards.”
- “Add the API documentation.”
- “Add error handling patterns.”
Before you know it, you’ve created The Prompt Blob Monster—a 10,000-token system prompt that tries to do everything.
Imagine a football manager who insists on playing all 20 squad members simultaneously. Ronaldo, Rooney, Scholes, Ferdinand, Vidić, Giggs, Tevez—everyone on the pitch at once. It sounds powerful. In reality? Chaos. Players trip over each other. No one knows their position. The team is slower, not faster.
That’s what a bloated system prompt does to your agent.
| The Enterprise Risk | What Happens |
|---|---|
| 💸 Cost Explosion | Every request pays for tokens the agent doesn’t need right now. |
| 🐢 Latency Creep | More tokens = slower first-token-time, especially at scale. |
| 🧠 Context Rot | Research shows LLMs lose reasoning quality in the “middle” of long contexts. |
| 🎯 Instruction Fog | Too many rules = the model forgets which ones matter now. |
The villain isn’t the LLM. It’s the architecture.
📊 The Reality Check:
What the Industry Shows Why It Matters Source Simpler chunking (512 tokens) outperformed complex AI-driven methods in accuracy — FloTorch Study, Feb 2026 Counterintuitively, loading less context often produces better results — supporting the case for strategic, selective disclosure. FloTorch RAG Benchmark Average enterprise LLM spend hit $7M/year in 2025, projected $11.6M in 2026 — a16z, 2025 As token costs scale into millions per year, progressive disclosure becomes one of the most direct levers for cost control. a16z Enterprise AI Report 70% of AI engineers have RAG in production or plan to deploy within 12 months — AI Engineering Survey, 2025 With RAG becoming standard infrastructure, how you manage context loading — not whether you do it — increasingly determines quality. 🔒 AI Engineering Survey · Community survey
The Concept
Skills are procedural memory—loaded on demand.
Instead of stuffing everything into one system prompt, you organize knowledge into discrete Skill files. The agent loads only what it needs, when it needs it.
💡 The Key Insight: Google’s Context Engineering guide defines this as Procedural Memory—“How-to” knowledge that’s retrieved just-in-time, not pre-loaded.
Think of it like a football manager’s bench.
Sir Alex Ferguson didn’t start every match with Ronaldo, Rooney, Tevez, Berbatov, and Scholes all on the pitch simultaneously. That’s chaos—too many creative players, no defensive structure. Instead, he read the game. Needed more pace against a tired defense? Bring on a fresh winger. Needed to hold a lead? Sub in a defensive midfielder.
The bench is your skill library. You don’t load every skill into context—just the one that matches the current situation. The 20-player squad exists, but only 11 play at a time.
This is Progressive Context Disclosure.
How It Works
The Context Window Is a Public Good
This is the key insight from Anthropic’s official Skills architecture: the context window is a public good. Your skill shares the context window with everything else — the system prompt, conversation history, other skills’ metadata, and the user’s actual request.
Not every token in your skill has an immediate cost. At startup, only the metadata (name and description) from all skills is pre-loaded. Claude reads SKILL.md only when the skill becomes relevant, and reads additional files only as needed. But being concise in SKILL.md still matters — once Claude loads it, every token competes with conversation history and other context.
💡 The Default Assumption: Claude is already very smart. Only add context Claude doesn’t already have. Challenge each piece of information: “Does Claude really need this explanation?” — “Can I assume Claude knows this?” — “Does this paragraph justify its token cost?”
The Skills Architecture
The SKILL.md Pattern
Each skill lives in its own folder with a SKILL.md file:
.agent/skills/
├── code-review/
│ ├── SKILL.md # Main instructions (loaded when triggered)
│ ├── examples.md # Usage examples (loaded as needed)
│ └── scripts/
│ └── lint.py # Utility script (executed, not loaded)
├── database/
│ └── SKILL.md
└── git-operations/
└── SKILL.md
Anatomy of a SKILL.md
---
name: code-review
description: Guidelines for reviewing pull requests
---
# Code Review Skill
## When to Apply
- User asks to review code, a PR, or a diff.
## Key Guidelines
1. Check for security vulnerabilities first.
2. Verify error handling is present.
3. Look for performance anti-patterns.
## Output Format
- Use inline comments for specific issues.
- Summarize overall assessment at the end.
The frontmatter (name, description) helps the agent decide when to load the skill.
The body contains the actual procedural knowledge.
Anthropic’s Progressive Disclosure Patterns
Anthropic defines three patterns for organizing skill content. Each pattern controls what gets loaded into context and when:
Pattern 1: High-Level Guide with References
Keep SKILL.md focused. Point to detailed files only when needed:
---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms,
and merges documents. Use when working with PDFs.
---
# PDF Processing
## Quick start
Extract text with pdfplumber:
...
## Advanced features
**Form filling**: See [FORMS.md](FORMS.md)
**API reference**: See [REFERENCE.md](REFERENCE.md)
**Examples**: See [EXAMPLES.md](EXAMPLES.md)
Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed — zero context penalty until accessed.
Pattern 2: Domain-Specific Organization
For skills with multiple domains, organize content by domain to avoid loading irrelevant context:
bigquery-skill/
├── SKILL.md (overview and navigation)
└── reference/
├── finance.md (revenue, billing metrics)
├── sales.md (opportunities, pipeline)
├── product.md (API usage, features)
└── marketing.md (campaigns, attribution)
When the user asks about revenue, Claude reads SKILL.md, sees the reference to reference/finance.md, and reads just that file. The sales.md and product.md files remain on the filesystem, consuming zero context tokens.
Pattern 3: Conditional Details
Show basic content; link to advanced content only when the user’s task requires it:
# DOCX Processing
## Creating documents
Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).
## Editing documents
For simple edits, modify the XML directly.
**For tracked changes**: See [REDLINING.md](REDLINING.md)
**For OOXML details**: See [OOXML.md](OOXML.md)
Claude reads REDLINING.md or OOXML.md only when the user needs those features.
The Rules That Make It Work
Anthropic codifies two critical rules:
| Rule | Why It Matters |
|---|---|
| 500-Line Limit | Keep SKILL.md body under 500 lines for optimal performance. Split into separate files when approaching this limit. |
| One-Level Deep | All reference files should link directly from SKILL.md. Avoid deeply nested references — Claude may only partially read files that are referenced from other referenced files. |
The Loading Pattern
- User makes a request → “Review this pull request.”
- Agent checks available skills → Sees
code-reviewmatches. - Agent loads the skill → SKILL.md content joins the context.
- Agent executes with specialized knowledge → Review follows guidelines.
The key: Most skills stay unloaded most of the time.
When to Use It
The Skill Threshold
Litmus Test: If knowledge is needed sometimes but not always, it’s a Skill.
| Context Type | When Needed | Where It Goes |
|---|---|---|
| Core identity, values | Always | 🎭 Persona (System Prompt) |
| Procedures, workflows | Sometimes | 📚 Skills (On-demand) |
| Facts, documents | Per-query | 📖 RAG (Retrieved) |
| Live system state | Real-time | 🔌 MCP (Connected) |
Real-World Examples
| Scenario | ❌ Blob Approach | ✅ Skills Approach |
|---|---|---|
| Multi-language support | 5,000 tokens of Python + TypeScript + Go syntax in every request | Load only the language skill matching the current file |
| Database operations | All SQL dialects in the prompt | Load postgres.md or mysql.md based on detected connection |
| Code review | Review guidelines always present | Load code-review skill only when reviewing |
The Token Math
Consider an agent with 10 specialized capabilities:
- Blob approach: 10 × 500 tokens = 5,000 tokens every request
- Skills approach: 500 tokens base + 500 tokens loaded = 1,000 tokens average
Result: 80% token reduction. Faster. Cheaper. Sharper focus.
Workflows & Feedback Loops
Skills aren’t just static instructions — they can encode multi-step workflows with built-in validation.
Checklist-Driven Workflows
For complex operations, Anthropic recommends providing a checklist that the agent can track through:
## Deployment Workflow
Copy this checklist and track your progress:
- [ ] Step 1: Run pre-flight checks
- [ ] Step 2: Build production bundle
- [ ] Step 3: Run integration tests
- [ ] Step 4: Deploy to staging
- [ ] Step 5: Verify health checks
- [ ] Step 6: Promote to production
Clear steps prevent the agent from skipping critical validation.
Feedback Loops
The most powerful skill pattern: Run → Validate → Fix → Repeat.
This pattern greatly improves output quality. The “validator” can be a script (python validate.py), a reference document (check against STYLE_GUIDE.md), or a built-in evaluation step.
Conditional Workflows
Guide the agent through decision points within a single skill:
## Document Modification
1. Determine the modification type:
**Creating new content?** → Follow "Creation workflow" below
**Editing existing content?** → Follow "Editing workflow" below
2. Creation workflow: Build document from scratch → Export
3. Editing workflow: Unpack → Modify XML → Validate → Repack
Evaluation-Driven Development
Anthropic recommends building evaluations before writing extensive documentation:
The 5-Step Process
- Identify gaps: Run Claude on representative tasks without a skill. Document specific failures.
- Create evaluations: Build three scenarios that test these gaps.
- Establish baseline: Measure Claude’s performance without the skill.
- Write minimal instructions: Create just enough content to address the gaps.
- Iterate: Execute evaluations, compare against baseline, and refine.
💡 The Key: This ensures you’re solving actual problems rather than documenting imagined ones. Evaluations are your source of truth for measuring skill effectiveness.
Enterprise Testing Tiers
For production deployments, Anthropic’s enterprise guidance defines three testing tiers:
| Tier | What It Tests | When |
|---|---|---|
| Triggering | Does the skill load only when appropriate? | Before merge |
| Execution | Does Claude follow the defined logic correctly? | Before deploy |
| Output Quality | Is the final artifact consistent across runs? | Ongoing |
Plus coexistence testing: Verify the new skill doesn’t cause regressions across your existing active skill set.
Advanced: Skills with Executable Code
Some skills go beyond instructions — they bundle utility scripts that Claude can execute directly.
Why Scripts Beat Generated Code
| Benefit | Explanation |
|---|---|
| More reliable | Pre-tested vs. generated on-the-fly |
| Save tokens | No need to include code in context |
| Save time | No code generation step required |
| Consistent | Same execution across every use |
The Execute vs. Read Distinction
Make clear in your instructions whether Claude should:
- Execute the script (most common): “Run
analyze_form.pyto extract fields” - Read it as reference (for complex logic): “See
analyze_form.pyfor the extraction algorithm”
For most utility scripts, execution is preferred — it’s more reliable and efficient. The script’s output consumes tokens, but the script itself does not.
MCP Tool References
If your skill uses MCP tools, always use fully qualified tool names to avoid “tool not found” errors:
Use the BigQuery:bigquery_schema tool to retrieve table schemas.
Use the GitHub:create_issue tool to create issues.
Format: ServerName:tool_name — Without the server prefix, Claude may fail to locate the tool.
📦 Case Study: The 92-Skill Library
In Vibe Product Design, we moved 90% of our prompt logic into a library of 92 specialized skills. This allows the same agent to design a PostgreSQL database, a Next.js frontend, or an AWS Lambda architecture without changing its system prompt.
Here is an actual skill file from the production system: skills/databases/SKILL.md.
---
name: database-design-postgresql
description: Principles for designing scalable PostgreSQL schemas
triggers:
- "design a database"
- "create schema"
- "postgres"
---
# PostgreSQL Design Skill
## When to Apply
- The user requires a relational database design.
- The constraint is "scalability" or "data integrity".
## <instructions>
1. **Use UUIDs for Primary Keys**:
- Always use `uuid_generate_v4()` or logic for IDs.
- Avoid sequential integers for public-facing resources.
2. **Indexing Strategy**:
- Create B-tree indexes on foreign keys.
- Use GIN indexes for JSONB columns.
3. **JSONB Usage**:
- Use JSONB for "schemaless" data (e.g., config, metadata).
- KEEP core relationships relational (Users, Orders).
</instructions>
## <template>
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW()
);
</template>
Why This Works:
- Frontmatter Triggers: The orchestrator only loads this when the user mentions “database” or “schema”.
- Structured Instructions: XML tags like
<instructions>help the LLM parse rules distinct from user data. - Concrete Templates: Providing a valid SQL block prevents syntax hallucinations.
🔗 From 92 to 889+: The community has scaled the skill pattern far beyond individual projects. Antigravity Awesome Skills is a curated collection of 889+ battle-tested agentic skills for Claude Code, Antigravity IDE, Cursor, and more. It includes role-based bundles (Web Wizard, Security Engineer, OSS Maintainer) that demonstrate skills at ecosystem scale — and you can install them with a single
npxcommand.
Industry Applications
Skills work the same way across domains—load procedural knowledge only when needed:
| Industry | Skills Examples | When Loaded |
|---|---|---|
| 🏦 Banking | fraud-investigation.md, loan-underwriting.md, kyc-verification.md | When task type detected |
| 🛒 Retail | return-processing.md, price-matching.md, shipping-policy.md | When customer request matches |
| 🎓 Education | grading-rubric.md, lesson-planning.md, accessibility-guidelines.md | When teaching scenario identified |
The Pattern Repeats
🏦 Banking Agent: Customer asks about wire transfer → loads wire-transfer-procedure.md with compliance steps and limits. Doesn’t load mortgage skills.
🛒 Retail Agent: Customer wants to return an item → loads return-processing.md with policy rules. Doesn’t load inventory or shipping skills.
🎓 Education Agent: Student struggles with fractions → loads remedial-math.md skill with step-by-step teaching approach. Doesn’t load advanced calculus.
🔍 The Skills Trap
You extracted 50 skills from your monolith prompt. Your agent is now faster, cheaper… and calls the wrong skill 20% of the time.
Three ways skill architecture creates the problems it was meant to solve:
| The Trap | What You Built | What Goes Wrong | The Fix |
|---|---|---|---|
| 🧩 Over-granular skills | 92 tiny skills, each covering one narrow task | The skill-matching step becomes a classification problem harder than the original task. With 92 options, the agent picks the wrong skill 15-20% of the time. Wrong skill = wrong procedure = wrong output. The overhead of choosing outweighs the benefit of specializing. | Group related skills into skill bundles (5-10 per bundle). Match at the bundle level first, then the skill level. “Database skills” → then “PostgreSQL” vs “MongoDB.” Two-stage matching is more accurate than 92-way single-stage matching. |
| 🎯 Trigger misfires | Keyword-based triggers: “database” → loads database-skill.md | User says “database” in the context of “our database is slow” (an operations question), not “design a database” (a design question). The trigger loads the wrong skill. Keyword matching doesn’t understand intent, it matches words. | Use semantic triggers with intent classification, not keyword matching. Pre-classify queries into intent categories, then map intents to skills. “Our database is slow” → intent: troubleshooting → skill: database-troubleshooting.md, not database-design.md. |
| 📅 Skill staleness | Skills written once, never updated | Your deploy-to-aws.md skill still references the old CDK v1 syntax. The skill is technically loaded correctly, but the instructions are outdated. The agent follows stale procedures confidently. No one remembers to update skills when the underlying tooling changes. | Assign ownership and review dates to each skill. Add a last_reviewed field in frontmatter. If a skill hasn’t been reviewed in 90 days, flag it. Treat skills like code — they need maintenance, versioning, and testing. |
💡 The Meta-Principle: Skills trade one complexity (prompt bloat) for another (skill management). The overhead is worth it at scale, but only if you invest in matching accuracy, intent understanding, and maintenance. A stale skill library is worse than a fat prompt — at least the fat prompt is current.
Key Takeaways
- ✅ Skills = Procedural Memory: “How-to” knowledge loaded on demand, not pre-stuffed.
- ✅ Context window is a public good: Every token in your skill competes with conversation history — be ruthlessly concise.
- ✅ Folder structure > prompt engineering: Organize knowledge into files, not longer prompts.
- ✅ Follow the 3 patterns: High-level guide with references, domain-specific organization, or conditional details.
- ✅ Progressive disclosure reduces cost: Only pay for context you’re actually using.
- ✅ Evaluate before you document: Build 3 test scenarios, establish a baseline, then write minimal skill instructions.
- ✅ Feedback loops improve quality: Run → Validate → Fix → Repeat is the most powerful skill pattern.
- ✅ Scalability by design: Add 100 skills without bloating every request.
What’s Next
- 📖 Previous article: The 4 Pillars: Persona, Skills, RAG, MCP — The decision framework for agent context.
- 📖 Next article: Context Engineering: Sessions & Memory — Managing short-term sessions and long-term memory.
- 💬 Discuss: How are you organizing procedural knowledge in your agents?
References
-
Google Cloud Research — Context Engineering: Sessions & Memory (2025). Defines Procedural Memory as “How-to” knowledge distinct from Semantic Memory (facts).
-
Anthropic — CLAUDE.md Pattern. The inspiration for skill-based context organization.
-
Galileo — The “Lost in the Middle” Phenomenon. Research on context degradation in long prompts.
-
Anthropic — Skill Authoring Best Practices (2025). Official guidelines for SKILL.md structure, progressive disclosure patterns, workflows, and degrees of freedom.
-
Anthropic — Skills for Enterprise (2025). Governance, security review, 6-stage lifecycle management, recall limits, and organizing skills at scale.
-
Anthropic — The Complete Guide to Building Skills for Claude (2025). Comprehensive PDF guide covering skill structure, testing tiers, and production deployment patterns.
-
Antigravity Awesome Skills — 889+ Agentic Skills for AI Coding Assistants (2025). Community-curated, battle-tested skill library demonstrating progressive disclosure at ecosystem scale.
❓ Frequently Asked Questions
What is progressive context disclosure for AI agents?
Loading only the knowledge an agent needs for the current task, rather than stuffing everything into the system prompt. This reduces costs, improves focus, and scales efficiently.
What is the SKILL.md pattern?
A structured markdown file that defines procedural knowledge with YAML frontmatter (name, description, triggers) and step-by-step instructions. Skills are loaded on-demand based on task context.
When should I use Skills vs RAG?
Use Skills for stable, procedural HOW-TO knowledge (e.g., deployment steps). Use RAG for dynamic, factual WHAT knowledge that changes frequently (e.g., product catalogs).
How do I evaluate the quality of my agent skills?
Build evaluations BEFORE writing extensive documentation. Identify gaps by running the agent without a skill, create 3 test scenarios, establish a baseline, then write minimal instructions and iterate. Anthropic recommends three testing tiers - triggering accuracy, execution correctness, and output quality.
💬 Join the Discussion
Got questions, feedback, or want to share your experience building AI agents? Join our community of architects and engineers.