Skill Creator

Relevant source files

This page introduces the skill-creator meta-skill: what it is, why it exists, and how its internal subsystems fit together. It is the entry point for the section; detailed documentation for each subsystem is in the child pages (Skill Creator Workflow, Evaluation Pipeline, Eval Viewer and Feedback Collection, Benchmarking and Reporting, and Description Optimization). For background on what a skill is and how SKILL.md works, see SKILL.md Format Specification.

What It Is

skill-creator is a skill that creates and iteratively improves other skills. It lives at skills/skill-creator/SKILL.md and, like every other skill in the repository, is activated by a natural language prompt — in this case, any prompt expressing a desire to build, test, benchmark, or improve a skill.

Its defining characteristic is that it orchestrates a full software-development-style feedback loop, but driven entirely through Claude: drafting a SKILL.md, spawning subagents to execute test cases, grading their outputs against assertions, aggregating statistics, presenting results in a browser-based viewer, and repeating until quality is satisfactory.

Because of this, skill-creator is the most operationally complex skill in the repository. It coordinates several distinct subsystems — each with its own agents, scripts, and data formats — that together implement a mini CI/CD pipeline for skill development.

Sources: skills/skill-creator/SKILL.md1-31

When to Use It

The skill-creator description field (which drives Claude's trigger logic) covers five scenarios:

Scenario	Typical prompt
Build a skill from scratch	"I want to make a skill for X"
Improve an existing skill	"Can you improve this skill I wrote?"
Run evaluations on a skill	"Run evals to test my skill"
Benchmark with variance analysis	"Benchmark this skill across N runs"
Optimize description accuracy	"Improve the triggering accuracy of this skill"

The skill is intentionally flexible about ordering. A user may arrive with just an idea, or with a half-finished SKILL.md and a set of test cases already written. skill-creator is designed to detect where the user is in the process and continue from there.

Sources: skills/skill-creator/SKILL.md1-26

Subsystems Overview

The diagram below maps the lifecycle phases to the concrete code entities that implement each one.

Diagram: Lifecycle Phases Mapped to Code Entities

Sources: skills/skill-creator/SKILL.md163-416

Internal File Layout

Diagram: skill-creator Directory Structure Mapped to Subsystems

Sources: skills/skill-creator/SKILL.md453-464 skills/skill-creator/references/schemas.md1-10

Workspace Layout

When skill-creator runs evaluations, it creates a sibling workspace directory next to the skill being developed. This keeps all evaluation artifacts separate from the skill source.

<skill-name>-workspace/
├── skill-snapshot/          # copy of the "old" skill for baseline (when improving)
├── evals/
│   └── evals.json           # test prompts and assertions
├── iteration-1/
│   ├── benchmark.json       # aggregated stats (produced by aggregate_benchmark.py)
│   ├── benchmark.md         # human-readable summary
│   ├── eval-0/              # named after what the eval tests
│   │   ├── eval_metadata.json
│   │   ├── with_skill/
│   │   │   ├── outputs/     # files produced by the skill
│   │   │   ├── grading.json
│   │   │   ├── metrics.json
│   │   │   └── timing.json
│   │   └── without_skill/
│   │       ├── outputs/
│   │       ├── grading.json
│   │       ├── metrics.json
│   │       └── timing.json
│   └── eval-1/
│       └── ...
└── iteration-2/
    └── ...

Sources: skills/skill-creator/SKILL.md163-198 skills/skill-creator/references/schemas.md85-160

Key Data Artifacts

The subsystems communicate through a set of JSON files. Each file has a defined schema documented in references/schemas.md.

File	Produced by	Consumed by
`evals/evals.json`	`skill-creator` (Claude drafts it)	Executor subagents, grader, aggregate script
`eval_metadata.json`	`skill-creator`	Grader, viewer
`timing.json`	Captured from subagent task notification	`aggregate_benchmark.py`
`metrics.json`	Executor subagent	Grader, `aggregate_benchmark.py`
`grading.json`	`agents/grader.md`	`aggregate_benchmark.py`, viewer
`benchmark.json`	`scripts/aggregate_benchmark.py`	Viewer (Benchmark tab)
`feedback.json`	`eval-viewer/viewer.html` (user action)	`skill-creator` (next iteration)
`comparison.json`	`agents/comparator.md`	`agents/analyzer.md`
`analysis.json`	`agents/analyzer.md`	`skill-creator` (improvement suggestions)

Sources: skills/skill-creator/references/schemas.md1-430 skills/skill-creator/SKILL.md268-282

In addition to the standard grading pass, skill-creator supports an optional blind comparison mode. Two outputs (e.g., with_skill and without_skill) are independently scored by agents/comparator.md without knowing which is which. The agents/analyzer.md agent then explains why the winner won and produces concrete improvement_suggestions.

This mode is optional and most useful when aggregate pass rates are close and the user needs a more nuanced quality signal. It requires subagents and is not available on Claude.ai.

Sources: skills/skill-creator/SKILL.md325-329

Platform Compatibility

skill-creator behavior varies by deployment context:

Feature	Claude Code	Claude.ai	Cowork
Parallel subagent execution	✓	✗ (runs sequentially, no baseline)	✓
Browser-based eval viewer	✓ (HTTP server)	✗ (inline in conversation)	✓ (`--static` flag)
Quantitative benchmarking	✓	✗	✓
Description optimization (`run_loop.py`)	✓	✗	✓
Blind comparison	✓	✗	✓
Packaging (`package_skill.py`)	✓	✓	✓

The --static <output_path> flag on eval-viewer/generate_review.py produces a standalone HTML file instead of starting an HTTP server, which is required in environments without a display.

Sources: skills/skill-creator/SKILL.md420-450

Skill Creator

Relevant source files

What It Is

Sources: skills/skill-creator/SKILL.md1-31

When to Use It

The skill-creator description field (which drives Claude's trigger logic) covers five scenarios:

Scenario	Typical prompt
Build a skill from scratch	"I want to make a skill for X"
Improve an existing skill	"Can you improve this skill I wrote?"
Run evaluations on a skill	"Run evals to test my skill"
Benchmark with variance analysis	"Benchmark this skill across N runs"
Optimize description accuracy	"Improve the triggering accuracy of this skill"

Sources: skills/skill-creator/SKILL.md1-26

Subsystems Overview

The diagram below maps the lifecycle phases to the concrete code entities that implement each one.

Diagram: Lifecycle Phases Mapped to Code Entities

Sources: skills/skill-creator/SKILL.md163-416

Internal File Layout

Diagram: skill-creator Directory Structure Mapped to Subsystems

Sources: skills/skill-creator/SKILL.md453-464 skills/skill-creator/references/schemas.md1-10

Workspace Layout

When skill-creator runs evaluations, it creates a sibling workspace directory next to the skill being developed. This keeps all evaluation artifacts separate from the skill source.

<skill-name>-workspace/
├── skill-snapshot/          # copy of the "old" skill for baseline (when improving)
├── evals/
│   └── evals.json           # test prompts and assertions
├── iteration-1/
│   ├── benchmark.json       # aggregated stats (produced by aggregate_benchmark.py)
│   ├── benchmark.md         # human-readable summary
│   ├── eval-0/              # named after what the eval tests
│   │   ├── eval_metadata.json
│   │   ├── with_skill/
│   │   │   ├── outputs/     # files produced by the skill
│   │   │   ├── grading.json
│   │   │   ├── metrics.json
│   │   │   └── timing.json
│   │   └── without_skill/
│   │       ├── outputs/
│   │       ├── grading.json
│   │       ├── metrics.json
│   │       └── timing.json
│   └── eval-1/
│       └── ...
└── iteration-2/
    └── ...

Sources: skills/skill-creator/SKILL.md163-198 skills/skill-creator/references/schemas.md85-160

Key Data Artifacts

The subsystems communicate through a set of JSON files. Each file has a defined schema documented in references/schemas.md.

File	Produced by	Consumed by
`evals/evals.json`	`skill-creator` (Claude drafts it)	Executor subagents, grader, aggregate script
`eval_metadata.json`	`skill-creator`	Grader, viewer
`timing.json`	Captured from subagent task notification	`aggregate_benchmark.py`
`metrics.json`	Executor subagent	Grader, `aggregate_benchmark.py`
`grading.json`	`agents/grader.md`	`aggregate_benchmark.py`, viewer
`benchmark.json`	`scripts/aggregate_benchmark.py`	Viewer (Benchmark tab)
`feedback.json`	`eval-viewer/viewer.html` (user action)	`skill-creator` (next iteration)
`comparison.json`	`agents/comparator.md`	`agents/analyzer.md`
`analysis.json`	`agents/analyzer.md`	`skill-creator` (improvement suggestions)

Sources: skills/skill-creator/references/schemas.md1-430 skills/skill-creator/SKILL.md268-282

This mode is optional and most useful when aggregate pass rates are close and the user needs a more nuanced quality signal. It requires subagents and is not available on Claude.ai.

Sources: skills/skill-creator/SKILL.md325-329

Platform Compatibility

skill-creator behavior varies by deployment context:

Feature	Claude Code	Claude.ai	Cowork
Parallel subagent execution	✓	✗ (runs sequentially, no baseline)	✓
Browser-based eval viewer	✓ (HTTP server)	✗ (inline in conversation)	✓ (`--static` flag)
Quantitative benchmarking	✓	✗	✓
Description optimization (`run_loop.py`)	✓	✗	✓
Blind comparison	✓	✗	✓
Packaging (`package_skill.py`)	✓	✓	✓

The --static <output_path> flag on eval-viewer/generate_review.py produces a standalone HTML file instead of starting an HTTP server, which is required in environments without a display.

Sources: skills/skill-creator/SKILL.md420-450

Skill Creator

What It Is

When to Use It

Subsystems Overview

Internal File Layout

Workspace Layout

Key Data Artifacts

Optional: Blind Comparison

Platform Compatibility

On this page

Skill Creator

What It Is

When to Use It

Subsystems Overview

Internal File Layout

Workspace Layout

Key Data Artifacts

Optional: Blind Comparison

Platform Compatibility

On this page