This page introduces the skill-creator meta-skill: what it is, why it exists, and how its internal subsystems fit together. It is the entry point for the section; detailed documentation for each subsystem is in the child pages (Skill Creator Workflow, Evaluation Pipeline, Eval Viewer and Feedback Collection, Benchmarking and Reporting, and Description Optimization). For background on what a skill is and how SKILL.md works, see SKILL.md Format Specification.
skill-creator is a skill that creates and iteratively improves other skills. It lives at skills/skill-creator/SKILL.md and, like every other skill in the repository, is activated by a natural language prompt — in this case, any prompt expressing a desire to build, test, benchmark, or improve a skill.
Its defining characteristic is that it orchestrates a full software-development-style feedback loop, but driven entirely through Claude: drafting a SKILL.md, spawning subagents to execute test cases, grading their outputs against assertions, aggregating statistics, presenting results in a browser-based viewer, and repeating until quality is satisfactory.
Because of this, skill-creator is the most operationally complex skill in the repository. It coordinates several distinct subsystems — each with its own agents, scripts, and data formats — that together implement a mini CI/CD pipeline for skill development.
Sources: skills/skill-creator/SKILL.md1-31
The skill-creator description field (which drives Claude's trigger logic) covers five scenarios:
| Scenario | Typical prompt |
|---|---|
| Build a skill from scratch | "I want to make a skill for X" |
| Improve an existing skill | "Can you improve this skill I wrote?" |
| Run evaluations on a skill | "Run evals to test my skill" |
| Benchmark with variance analysis | "Benchmark this skill across N runs" |
| Optimize description accuracy | "Improve the triggering accuracy of this skill" |
The skill is intentionally flexible about ordering. A user may arrive with just an idea, or with a half-finished SKILL.md and a set of test cases already written. skill-creator is designed to detect where the user is in the process and continue from there.
Sources: skills/skill-creator/SKILL.md1-26
The diagram below maps the lifecycle phases to the concrete code entities that implement each one.
Diagram: Lifecycle Phases Mapped to Code Entities
Sources: skills/skill-creator/SKILL.md163-416
Diagram: skill-creator Directory Structure Mapped to Subsystems
Sources: skills/skill-creator/SKILL.md453-464 skills/skill-creator/references/schemas.md1-10
When skill-creator runs evaluations, it creates a sibling workspace directory next to the skill being developed. This keeps all evaluation artifacts separate from the skill source.
<skill-name>-workspace/
├── skill-snapshot/ # copy of the "old" skill for baseline (when improving)
├── evals/
│ └── evals.json # test prompts and assertions
├── iteration-1/
│ ├── benchmark.json # aggregated stats (produced by aggregate_benchmark.py)
│ ├── benchmark.md # human-readable summary
│ ├── eval-0/ # named after what the eval tests
│ │ ├── eval_metadata.json
│ │ ├── with_skill/
│ │ │ ├── outputs/ # files produced by the skill
│ │ │ ├── grading.json
│ │ │ ├── metrics.json
│ │ │ └── timing.json
│ │ └── without_skill/
│ │ ├── outputs/
│ │ ├── grading.json
│ │ ├── metrics.json
│ │ └── timing.json
│ └── eval-1/
│ └── ...
└── iteration-2/
└── ...
Sources: skills/skill-creator/SKILL.md163-198 skills/skill-creator/references/schemas.md85-160
The subsystems communicate through a set of JSON files. Each file has a defined schema documented in references/schemas.md.
| File | Produced by | Consumed by |
|---|---|---|
evals/evals.json | skill-creator (Claude drafts it) | Executor subagents, grader, aggregate script |
eval_metadata.json | skill-creator | Grader, viewer |
timing.json | Captured from subagent task notification | aggregate_benchmark.py |
metrics.json | Executor subagent | Grader, aggregate_benchmark.py |
grading.json | agents/grader.md | aggregate_benchmark.py, viewer |
benchmark.json | scripts/aggregate_benchmark.py | Viewer (Benchmark tab) |
feedback.json | eval-viewer/viewer.html (user action) | skill-creator (next iteration) |
comparison.json | agents/comparator.md | agents/analyzer.md |
analysis.json | agents/analyzer.md | skill-creator (improvement suggestions) |
Sources: skills/skill-creator/references/schemas.md1-430 skills/skill-creator/SKILL.md268-282
In addition to the standard grading pass, skill-creator supports an optional blind comparison mode. Two outputs (e.g., with_skill and without_skill) are independently scored by agents/comparator.md without knowing which is which. The agents/analyzer.md agent then explains why the winner won and produces concrete improvement_suggestions.
This mode is optional and most useful when aggregate pass rates are close and the user needs a more nuanced quality signal. It requires subagents and is not available on Claude.ai.
Sources: skills/skill-creator/SKILL.md325-329
skill-creator behavior varies by deployment context:
| Feature | Claude Code | Claude.ai | Cowork |
|---|---|---|---|
| Parallel subagent execution | ✓ | ✗ (runs sequentially, no baseline) | ✓ |
| Browser-based eval viewer | ✓ (HTTP server) | ✗ (inline in conversation) | ✓ (--static flag) |
| Quantitative benchmarking | ✓ | ✗ | ✓ |
Description optimization (run_loop.py) | ✓ | ✗ | ✓ |
| Blind comparison | ✓ | ✗ | ✓ |
Packaging (package_skill.py) | ✓ | ✓ | ✓ |
The --static <output_path> flag on eval-viewer/generate_review.py produces a standalone HTML file instead of starting an HTTP server, which is required in environments without a display.
Sources: skills/skill-creator/SKILL.md420-450
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.