28 - Skills

Skills are instruction files that teach agents how to perform specific tasks. They don't call APIs, don't maintain state, don't run as processes. They're folders of markdown, scripts, and reference material that the agent loads on-demand when relevant to the current task.

If MCP is a USB port (connecting to external tools), skills are the employee handbook (teaching the agent your specific procedures).

This is the open specification: https://agentskills.io/specification

As of early 2026, the Agent Skills format is adopted by Claude Code, OpenAI Codex, GitHub Copilot, VS Code, Cursor, Gemini CLI, Amp, Goose, and 20+ other platforms. Write once, use everywhere.

Why Skills Exist

Without skills, you repeat yourself. Every conversation, you re-explain:

"Use pdfplumber, not PyPDF2"
"Our API uses snake_case, not camelCase"
"Always run the linter after editing"
"The test database is on port 5433, not 5432"

Skills package these instructions into a reusable, version-controlled file that the agent loads automatically when the task matches. You write the instruction once. Every conversation benefits.

Skills vs Prompts vs MCP

	System Prompt	Skill	MCP
Scope	This conversation	Any matching conversation	Any matching conversation
What it provides	Context, personality, constraints	Procedures, workflows, domain knowledge	Tool execution, data access
Token cost	Loaded every turn	~100 tokens discovery, ~5K on activation	550-1,400 tokens per tool, every turn
Persistence	Per-conversation	Permanent, version-controlled	Permanent (server process)
Statefulness	No	No	Yes (can maintain connections, sessions)

Use system prompts for conversation-level instructions. Use skills for reusable procedures. Use MCP for external capabilities the agent can't achieve with instructions alone.

Skills vs Project Instructions (CLAUDE.md, .cursorrules, etc.)

Most coding agents support project-level instruction files: CLAUDE.md for Claude Code, .cursorrules for Cursor, .github/copilot-instructions.md for Copilot. These are loaded every conversation in that project.

Skills are different: they load on-demand when the task matches. The distinction:

Project instructions → "always true" facts about this project (coding conventions, test commands, architecture decisions)
Skills → reusable procedures for task types that work across projects (PDF processing, data pipeline, thesis grading)

If you find yourself putting step-by-step workflows into CLAUDE.md, extract them into skills instead. Keep CLAUDE.md for project context; use skills for procedures.

Level 1: The Simplest Possible Skill

A skill is a directory containing one file: SKILL.md.

my-first-skill/
└── SKILL.md

SKILL.md has two parts: YAML frontmatter (metadata) and markdown body (instructions).

---
name: my-first-skill
description: Formats code output with line numbers and file paths. Use when generating code files or showing code to the user.
---

When showing code to the user, always include:
1. The full file path as a comment at the top
2. Line numbers for any code block longer than 10 lines

Example:
```python
# src/utils/helpers.py
def calculate_total(items):
    return sum(item.price for item in items)

That's it. The agent reads the frontmatter at startup (~100 tokens), decides whether this skill is relevant to the current task, and if so, loads the full body.

Frontmatter Fields

Field	Required	What it does
`name`	Yes	Identifier. Max 64 chars. Lowercase, hyphens only. Must match directory name.
`description`	Yes	Tells the agent when to use this skill. Max 1024 chars. This is the trigger.
`license`	No	License name or file reference.
`compatibility`	No	Environment requirements. Max 500 chars.
`metadata`	No	Arbitrary key-value pairs (author, version, tags).
`allowed-tools`	No	Pre-approved tools the skill may use. Experimental.

The `name` Field Rules

name: pdf-processing       # Valid
name: code-review          # Valid
name: data-analysis        # Valid
name: PDF-Processing       # INVALID: uppercase
name: -pdf                 # INVALID: starts with hyphen
name: pdf--processing      # INVALID: consecutive hyphens
name: my cool skill        # INVALID: spaces

Must match the parent directory name exactly.

The `description` Field Is Everything

The description determines whether the skill activates. The agent scans all available skill descriptions at startup (~100 tokens each) and loads the full instructions only for relevant ones.

Bad — will never trigger:

description: Helps with PDFs.

Good — clear trigger conditions:

description: Extract text and tables from PDF files, fill PDF forms, and merge multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.

Aggressive — recommended by Anthropic because agents undertrigger:

description: >-
  Apply TalTech thesis grading criteria to academic documents.
  Use this skill whenever the user mentions thesis, grading,
  academic evaluation, defense, rubric, or assessment criteria,
  even if they don't explicitly ask for grading.

Include:

What the skill does
When to use it (positive triggers)
Keywords the user might say
Optionally: when NOT to use it (if similar skills exist)

Level 2: Adding Reference Material

When SKILL.md needs to stay under 500 lines but the domain knowledge is larger, split into referenced files.

thesis-grader/
├── SKILL.md
└── references/
    ├── grading-matrix.md
    └── bloom-taxonomy.md

SKILL.md:

---
name: thesis-grader
description: >-
  Evaluate master's thesis documents against TalTech grading criteria.
  Use when reviewing thesis structure, methodology, or academic quality.
---

## Workflow

1. Read the thesis document
2. Load the grading matrix: see [grading-matrix.md](references/grading-matrix.md)
3. Evaluate each criterion on a 0-5 scale
4. If assessing learning outcomes, consult [bloom-taxonomy.md](references/bloom-taxonomy.md)
5. Produce a structured evaluation report

## Output Format

# Thesis Evaluation: [Title]

| Criterion | Score (0-5) | Justification |
|---|---|---|
| Research question clarity | X | ... |
| Methodology | X | ... |
| Literature review | X | ... |
| Results and analysis | X | ... |
| Writing quality | X | ... |

**Overall recommendation:** [Pass / Revise / Fail]

## Gotchas

- Level 0 (fail) means the criterion was not addressed at all
- Level 5+ (publishable quality) is a separate category, not just "really good level 5"
- Check citation format consistency — mixed APA/IEEE is an automatic deduction

The key: tell the agent when to load each reference file. "Load grading-matrix.md" is better than "see references/ for details." The agent may not recognize when it needs a file if you don't specify the trigger.

Progressive Disclosure in Action

This is why skills are cheap: a workspace with 20 installed skills costs only ~2,000 tokens at startup. Only the relevant skill loads its full instructions. Only specific references load when needed. Compare to MCP where 20 tools cost 11,000-28,000 tokens every single turn.

Level 3: Adding Scripts

Scripts make skills deterministic where it matters. Instead of hoping the agent writes correct parsing code, you provide a tested script.

pdf-processor/
├── SKILL.md
├── scripts/
│   ├── analyze_form.py
│   ├── validate_fields.py
│   └── fill_form.py
└── references/
    └── REFERENCE.md

SKILL.md:

---
name: pdf-processor
description: >-
  Extract text and tables from PDF files, fill PDF forms, merge documents.
  Use when working with PDF files or when the user mentions PDFs, forms,
  or document extraction.
---

## Text Extraction

Use pdfplumber for text extraction. For scanned documents, fall back to
pdf2image with pytesseract.

## Form Filling Workflow

1. Analyze the form:
   bash
   uv run scripts/analyze_form.py input.pdf
   This produces `form_fields.json` listing every field name, type, and whether it's required.

2. Create `field_values.json` mapping each field name to its intended value.

3. Validate the mapping:
   bash
   uv run scripts/validate_fields.py form_fields.json field_values.json
   
   Fix any errors before proceeding.

4. Fill the form:
   bash
   uv run scripts/fill_form.py input.pdf field_values.json output.pdf
   

5. Verify the output visually or with:
   bash
   uv run scripts/analyze_form.py output.pdf

Script Design Principles

Scripts run in a non-interactive shell. The agent reads stdout/stderr to decide what to do next. Design accordingly:

1. No interactive prompts. Ever.

# BAD: hangs forever
target = input("Target environment: ")

# GOOD: use flags
parser.add_argument("--env", required=True, choices=["dev", "staging", "prod"])

2. Implement --help

This is how the agent discovers the interface. Keep it concise — the output enters the context window.

Usage: scripts/process.py [OPTIONS] INPUT_FILE

Options:
  --format FORMAT    Output format: json, csv, table (default: json)
  --output FILE      Write to FILE instead of stdout
  --verbose          Print progress to stderr

Examples:
  scripts/process.py data.csv
  scripts/process.py --format csv --output report.csv data.csv

3. Informative error messages

The error message shapes the agent's next attempt. Vague errors waste a tool-use turn.

# BAD
print("Error: invalid input")
sys.exit(1)

# GOOD
print(f"Error: --format must be one of: json, csv, table. Received: '{args.format}'", file=sys.stderr)
sys.exit(1)

4. Structured output (JSON to stdout, diagnostics to stderr)

import json, sys

# Data goes to stdout (agent parses it)
json.dump({"status": "ok", "fields": field_list}, sys.stdout)

# Progress/warnings go to stderr (agent reads but doesn't parse)
print("Processing page 3/10...", file=sys.stderr)

5. Idempotent by default

Agents retry. "Create if not exists" is safer than "create and fail on duplicate."

6. Predictable output size

Many agent harnesses truncate tool output at 10-30K characters. If your script might produce large output, default to a summary or support --limit / --offset pagination.

Self-Contained Scripts (Inline Dependencies)

The agent shouldn't need to run pip install first. Use inline dependency declarations:

Python (PEP 723 + uv):

# /// script
# dependencies = [
#   "pdfplumber>=0.10",
#   "beautifulsoup4>=4.12,<5",
# ]
# requires-python = ">=3.10"
# ///

import pdfplumber
# ... rest of script

Run with: uv run scripts/extract.py

uv creates an isolated environment, installs dependencies, runs the script. First run downloads; subsequent runs use cache.

Node.js (npx):

npx -y eslint@9 --fix .

Deno (npm: imports):

import * as cheerio from "npm:cheerio@1.0.0";

Go:

go run golang.org/x/tools/cmd/goimports@v0.28.0 .

Pin versions always. Unpinned dependencies = non-reproducible skill.

Level 4: The Skill as a CLI Toolkit

Complex skills can bundle a full CLI tool that the agent calls for various sub-commands. This is the pattern used by Anthropic's own production skills (docx, xlsx, pptx, pdf).

data-pipeline/
├── SKILL.md
├── scripts/
│   ├── pipeline.py          # Main CLI entry point
│   ├── validators/
│   │   ├── schema.py
│   │   └── quality.py
│   └── transforms/
│       ├── normalize.py
│       └── aggregate.py
├── references/
│   ├── schema-format.md
│   └── error-codes.md
└── assets/
    └── default-config.yaml

SKILL.md:

---
name: data-pipeline
description: >-
  Build, validate, and run ETL data pipelines with quality checks.
  Use when the user wants to process, transform, validate, or load data,
  or mentions ETL, data quality, schema validation, or data ingestion.
compatibility: Requires Python 3.10+ and uv
allowed-tools: Bash(uv:*) Read Write
---

## Quick Reference

All operations go through the pipeline CLI:

bash
# Validate a schema
uv run scripts/pipeline.py validate-schema data/input.csv

# Run quality checks
uv run scripts/pipeline.py check-quality data/input.csv --rules references/quality-rules.yaml

# Transform data
uv run scripts/pipeline.py transform data/input.csv --config assets/default-config.yaml --output data/output.parquet

# Full pipeline (validate → check → transform → load)
uv run scripts/pipeline.py run --config pipeline.yaml

Run `uv run scripts/pipeline.py --help` for full documentation.
Run `uv run scripts/pipeline.py <command> --help` for command-specific help.

## Workflow

1. **Start with schema validation.** Always validate input schema before processing.
2. **Run quality checks.** Review the quality report before transforming.
3. **Transform with the default config** unless the user specifies otherwise.
4. If quality checks fail, consult [error-codes.md](references/error-codes.md) for resolution steps.
5. If the schema format is unfamiliar, consult [schema-format.md](references/schema-format.md).

## Gotchas

- The `transform` command writes Parquet by default. Use `--format csv` for CSV output.
- Column names are normalized to snake_case automatically. To preserve original names: `--preserve-names`
- The `--dry-run` flag on any command shows what would happen without executing.
- Large files (>100MB): use `--streaming` mode to avoid memory issues.

The Pattern: Thin SKILL.md, Fat CLI

The SKILL.md is a routing document: it tells the agent which sub-command to use for each task type. The real logic lives in the Python scripts, where you get:

Proper argument parsing with argparse or click
Unit tests for the pipeline logic
Type safety, error handling, retry logic
Dependency management via PEP 723

The agent calls the CLI. The CLI handles execution deterministically. The agent interprets the output and decides what to do next. This is the sweet spot: LLM reasoning for high-level decisions, deterministic code for execution.

Level 5: Skills with Evaluation and Iteration

Anthropic's skill-creator skill is a meta-skill that builds other skills. It includes a structured evaluation loop:

Write the skill (SKILL.md + scripts + references)
Define test cases (input prompts + expected outputs)
Run the skill against test cases
Grade results (automated assertions + human review)
Iterate (fix instructions, re-test)

The evaluation framework uses:

evals.json — test case definitions
grading.json — assertion criteria
Subagents for grading (comparator, analyzer)
An eval viewer HTML report for human review

A minimal evals.json entry looks like this:

[
  {
    "name": "basic_pdf_extraction",
    "prompt": "Extract all text from tests/fixtures/sample.pdf and return it as markdown",
    "expected": {
      "contains": ["Chapter 1", "Introduction"],
      "format": "markdown",
      "script_exits_zero": "scripts/validate_output.py"
    }
  },
  {
    "name": "handles_scanned_pdf",
    "prompt": "Extract text from tests/fixtures/scanned.pdf",
    "expected": {
      "contains": ["OCR"],
      "uses_tool": "scripts/analyze_form.py"
    }
  }
]

The pattern: define input prompts, specify what the output must contain or which scripts must succeed, then run the skill against each case and compare. This is test-driven development applied to agent instructions.

You don't need this complexity for every skill. But for production skills deployed across an organization, structured evaluation is the difference between "usually works" and "reliably works."

Source: https://github.com/anthropics/skills/blob/main/skills/skill-creator/SKILL.md

Best Practices

Write What the Agent Doesn't Know

Don't explain what a PDF is. Don't explain how HTTP works. Focus on what's specific to your project, team, or domain that the agent would get wrong without instructions.

<!-- Wasted tokens — the agent knows this -->
PDF (Portable Document Format) is a common file format that contains
text, images, and other content.

<!-- High value — the agent doesn't know this -->
Use pdfplumber for text extraction. For scanned documents, fall back to
pdf2image with pytesseract.

Test: "Would the agent get this wrong without this instruction?" If no, cut it.

Gotchas Are the Highest-Value Content

Every time an agent makes a mistake you have to correct, add that correction to the gotchas section. This is the fastest path to improving a skill.

## Gotchas

- The `users` table uses soft deletes. Always include `WHERE deleted_at IS NULL`.
- User ID is `user_id` in the DB, `uid` in auth, `accountId` in billing.
  All three are the same value.
- The `/health` endpoint returns 200 even if the database is down. Use `/ready`.
- When using Estonian locale, date format is DD.MM.YYYY, not MM/DD/YYYY.

Provide Defaults, Not Menus

<!-- BAD: forces the agent to choose -->
You can use pypdf, pdfplumber, PyMuPDF, or pdf2image for extraction...

<!-- GOOD: clear default, escape hatch for edge case -->
Use pdfplumber for text extraction.
For scanned PDFs requiring OCR, use pdf2image with pytesseract instead.

Procedures Over Declarations

<!-- BAD: specific answer, only works for this exact task -->
Join `orders` to `customers` on `customer_id`, filter `region = 'EMEA'`.

<!-- GOOD: reusable method -->
1. Read schema from `references/schema.yaml` to find relevant tables
2. Join using the `_id` foreign key convention
3. Apply user's filters as WHERE clauses
4. Aggregate numeric columns as needed

Validate Before Proceeding

The plan-validate-execute pattern prevents cascading errors:

1. Generate the migration script → save to `migration.sql`
2. Run `scripts/validate_migration.py migration.sql` to check for:
   - Missing rollback statements
   - References to non-existent tables
   - Data loss risks
3. If validation fails, fix and re-validate
4. Only after validation passes: execute the migration

Installing Skills

Claude Code

# From a local directory
# Skills in .claude/skills/ are auto-discovered

# Upload as zip in claude.ai
# Settings > Capabilities > Skills > Upload

# From the skill directory (used as slash command)
# Place SKILL.md in .claude/skills/my-skill/SKILL.md

VS Code / GitHub Copilot

.github/skills/my-skill/SKILL.md

Or configure via chat.skillsLocations setting.

Cursor

Place in .cursor/skills/ or configure in settings.

OpenAI Codex

codex skills add ./my-skill

Or place in .codex/skills/.

Cross-Platform

The Agent Skills format is the same everywhere. The same SKILL.md file works across all platforms. Distribution mechanisms differ (zip upload, directory placement, marketplace), but the file format is identical.

Validate your skill against the spec:

# Using the reference validator
npx skills-ref validate ./my-skill

# Checks: frontmatter validity, name conventions, line count, token budget

Skills vs MCP: Decision Framework

Do I need the agent to CALL something external?
  → MCP (API, database, browser, file system)

Do I need the agent to KNOW something specific?
  → Skill (procedures, conventions, domain knowledge)

Do I need the agent to RUN deterministic code?
  → Skill with scripts (validation, formatting, analysis)

Do I need both knowledge AND external access?
  → Skill (instructions) + MCP (tools), used together

Am I repeating the same prompt instructions across conversations?
  → Extract into a skill

Is this a one-off task with unique context?
  → System prompt or just tell the agent directly

Skills and MCP compose naturally. A skill can instruct the agent to use MCP tools in a specific sequence:

## Deployment Workflow

Run tests: `scripts/run_tests.sh`
Check GitHub CI status using the GitHub MCP server
If CI passes, deploy using: `scripts/deploy.sh --env staging`
Verify deployment using the Playwright MCP to check the health endpoint
If health check fails, rollback: `scripts/deploy.sh --rollback`

The skill provides the procedure. MCP provides the capabilities. The agent orchestrates.

Example: A Course-Relevant Skill

Here's a skill you might build for the homework:

api-provider-adapter/
├── SKILL.md
├── scripts/
│   ├── test_provider.py
│   └── generate_adapter.py
├── references/
│   ├── openai-format.md
│   └── anthropic-format.md
└── assets/
    └── adapter-template.ts

SKILL.md:

---
name: api-provider-adapter
description: >-
  Generate dual-provider API adapters for OpenAI and Anthropic.
  Use when implementing tool calling, message formatting, or
  streaming across multiple LLM providers. Use when the user
  mentions dual provider, multi-provider, OpenAI/Anthropic
  compatibility, or adapter pattern.
compatibility: Requires Node.js 18+ or Python 3.10+
---

## Workflow

1. Identify which API features need adaptation (messages, tools, streaming)
2. For tool calling differences, consult [openai-format.md](references/openai-format.md)
   and [anthropic-format.md](references/anthropic-format.md)
3. Generate adapter code using the template:
   bash
   uv run scripts/generate_adapter.py --features tools,streaming --lang typescript
   
4. Test against both providers:
   bash
   uv run scripts/test_provider.py --provider openai --adapter ./adapter.ts
   uv run scripts/test_provider.py --provider anthropic --adapter ./adapter.ts
   

## Key Differences to Handle

- OpenAI: arguments as JSON string (must parse), role "tool" for results
- Anthropic: arguments as parsed object, tool_result inside "user" message
- OpenAI: `finish_reason: "tool_calls"`, Anthropic: `stop_reason: "tool_use"`
- See lecture 26 for the complete comparison table

## Gotchas

- OpenAI's Responses API uses different item types than Chat Completions.
  The adapter must handle both if supporting legacy code.
- Anthropic returns `input` (parsed JSON), OpenAI returns `arguments` (JSON string).
  Forgetting to JSON.parse() the OpenAI side is the #1 student bug.
- Streaming format differs significantly. Don't try to unify streaming
  into a common format on the first pass — get non-streaming working first.

References

Why Skills Exist​

Skills vs Prompts vs MCP​

Skills vs Project Instructions (CLAUDE.md, .cursorrules, etc.)​

Level 1: The Simplest Possible Skill​

Frontmatter Fields​

The name Field Rules​

The description Field Is Everything​

Level 2: Adding Reference Material​

Progressive Disclosure in Action​

Level 3: Adding Scripts​

Script Design Principles​

Self-Contained Scripts (Inline Dependencies)​

Level 4: The Skill as a CLI Toolkit​

The Pattern: Thin SKILL.md, Fat CLI​

Level 5: Skills with Evaluation and Iteration​

Best Practices​

Write What the Agent Doesn't Know​

Gotchas Are the Highest-Value Content​

Provide Defaults, Not Menus​

Procedures Over Declarations​

Validate Before Proceeding​

Installing Skills​

Claude Code​

VS Code / GitHub Copilot​

Cursor​

OpenAI Codex​

Cross-Platform​

Skills vs MCP: Decision Framework​

Example: A Course-Relevant Skill​

References​

Specification​

Examples and Registries​

Platform-Specific Documentation​

Deep Dives​

Why Skills Exist

Skills vs Prompts vs MCP

Skills vs Project Instructions (CLAUDE.md, .cursorrules, etc.)

Level 1: The Simplest Possible Skill

Frontmatter Fields

The `name` Field Rules

The `description` Field Is Everything

Level 2: Adding Reference Material

Progressive Disclosure in Action

Level 3: Adding Scripts

Script Design Principles

Self-Contained Scripts (Inline Dependencies)

Level 4: The Skill as a CLI Toolkit

The Pattern: Thin SKILL.md, Fat CLI

Level 5: Skills with Evaluation and Iteration

Best Practices

Write What the Agent Doesn't Know

Gotchas Are the Highest-Value Content

Provide Defaults, Not Menus

Procedures Over Declarations

Validate Before Proceeding

Installing Skills

Claude Code

VS Code / GitHub Copilot

Cursor

OpenAI Codex

Cross-Platform

Skills vs MCP: Decision Framework

Example: A Course-Relevant Skill

References

Specification

Examples and Registries

Platform-Specific Documentation

Deep Dives