This page documents the description optimization subsystem inside skill-creator. Its purpose is to iteratively refine a SKILL.md description field so that Claude triggers the skill on the right queries and ignores it on irrelevant ones.
For background on how descriptions are used at runtime to match skills to user intent, see SKILL.md Format Specification. For the broader skill creation lifecycle that invokes description optimization as a final step, see Skill Creator Workflow. For reporting on prior evaluation runs (a related but distinct subsystem), see Benchmarking and Reporting.
A skill's description field (max 1024 characters) is the sole signal Claude uses to decide whether to invoke a skill for a given user query. A poorly worded description causes two failure modes:
| Failure Mode | Meaning |
|---|---|
| Failed trigger | Skill should have activated but the description didn't match the query |
| False trigger | Skill activated when it should have stayed silent |
The optimization loop treats this as a search problem: given a set of labeled test queries (should-trigger and should-not-trigger), find the description text that maximizes pass rate across both categories.
| File | Role |
|---|---|
skills/skill-creator/scripts/improve_description.py | Core optimization logic; calls Claude API with extended thinking |
skills/skill-creator/scripts/generate_report.py | Renders the iteration history as an HTML report |
skills/skill-creator/assets/eval_review.html | Browser UI for building and exporting eval query sets |
Diagram: Description Optimization Data Flow
Sources: skills/skill-creator/scripts/improve_description.py1-248
improve_description.py — Core Functionimprove_description() at skills/skill-creator/scripts/improve_description.py19-190 is the central entry point. Its parameters:
| Parameter | Type | Description |
|---|---|---|
client | anthropic.Anthropic | Initialized API client |
skill_name | str | Skill identifier (for prompt context) |
skill_content | str | Full SKILL.md body (for semantic context) |
current_description | str | The description being replaced |
eval_results | dict | Output from run_eval.py |
history | list[dict] | All previous description attempts |
model | str | Claude model identifier |
test_results | dict | None | Results on held-out test queries |
log_dir | Path | None | Directory to write transcript logs |
iteration | int | None | Current iteration number (for log naming) |
Before constructing the prompt, the function splits evaluation results into two categories:
failed_triggers = results where should_trigger=True and pass=False
false_triggers = results where should_trigger=False and pass=False
skills/skill-creator/scripts/improve_description.py32-39
Each entry in eval_results["results"] carries:
| Field | Type | Meaning |
|---|---|---|
query | str | The test query string |
should_trigger | bool | Expected behavior |
pass | bool | Whether the description passed this query |
triggers | int | How many runs triggered (out of runs) |
runs | int | Total runs executed for this query |
The triggers/runs ratio appears in the prompt so Claude understands probabilistic failure (e.g., triggered 1/3 times) versus complete failure (0/3).
The prompt sent to Claude at skills/skill-creator/scripts/improve_description.py49-112 is assembled in sections:
Diagram: Prompt Assembly Sections
Sources: skills/skill-creator/scripts/improve_description.py49-112
Key constraints communicated to Claude in the prompt:
skills/skill-creator/scripts/improve_description.py114-121
client.messages.create(
model=model,
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": prompt}],
)
Extended thinking is deliberately enabled: the model reasons about why past attempts succeeded or failed before generating a new candidate. The thinking block is captured and written to the transcript log but is not parsed for the description itself.
The response is parsed for content between <new_description> and </new_description> tags:
skills/skill-creator/scripts/improve_description.py133-135
If the extracted description exceeds 1024 characters (the hard limit imposed by the SKILL.md format), a follow-up API call is made in the same conversation thread:
skills/skill-creator/scripts/improve_description.py149-181
The follow-up prompt instructs the model to shorten the text while preserving the most important trigger words and intent coverage. This is a three-turn conversation: the original prompt, the model's long response, then the shortening request. Extended thinking is also enabled on the rewrite call.
Diagram: Length Constraint Handling
Sources: skills/skill-creator/scripts/improve_description.py149-181
Each call to improve_description() writes a JSON transcript to log_dir/improve_iter_N.json. The transcript contains:
| Key | Contents |
|---|---|
iteration | Iteration number |
prompt | Full prompt sent to Claude |
thinking | Extended thinking block text |
response | Raw model text response |
parsed_description | Description extracted from tags |
char_count | Length of parsed description |
over_limit | Boolean flag |
rewrite_prompt | (if over limit) Second prompt |
rewrite_thinking | (if over limit) Thinking from rewrite |
rewrite_response | (if over limit) Raw rewrite response |
rewrite_description | (if over limit) Shortened text |
rewrite_char_count | (if over limit) Final length |
final_description | The description actually returned |
skills/skill-creator/scripts/improve_description.py138-188
At the end of each iteration, the caller appends the current attempt to the history list. The structure stored per history entry is:
skills/skill-creator/scripts/improve_description.py232-243
When the history is passed back into the next iteration's prompt, each past attempt is rendered as an <attempt> XML block showing the description, train/test scores, and per-query pass/fail status. This prevents the model from cycling back to previously tried phrasings.
The optimization loop supports a holdout set to detect overfitting. Queries in eval_results are used for training (the model sees these failures when generating improvements). Queries in test_results are a held-out set the optimizer never trains on directly — they are reported as a separate score.
generate_report.py names this the holdout count and tracks it in data["holdout"].
Diagram: Train/Test Query Flow
Sources: skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py42-45
generate_report.py — Iteration History Visualizationgenerate_report.py renders the full run loop history as a static HTML page. The generate_html() function at skills/skill-creator/scripts/generate_report.py16-301 takes:
| Parameter | Type |
|---|---|
data | dict — full run loop output JSON |
auto_refresh | bool — adds <meta http-equiv="refresh" content="5"> for live updates |
skill_name | str — used in page title |
The HTML table has one row per iteration. Columns are:
| Column | Contents |
|---|---|
Iter | Iteration number |
Train | correct/total_runs across all train queries |
Test | correct/total_runs across all test (holdout) queries |
Description | The candidate description text |
| (per train query) | ✓ or ✗ with triggers/runs sub-label |
| (per test query) | Same, visually distinguished with background color |
Score cell CSS classes (score-good, score-ok, score-bad) are assigned by thresholds: ≥ 80% correct → good, ≥ 50% → ok, otherwise bad.
skills/skill-creator/scripts/generate_report.py244-256
The best-performing iteration (by test score if available, otherwise train score) receives the best-row CSS class for visual highlighting. skills/skill-creator/scripts/generate_report.py206-210
generate_report.py <input.json> [-o output.html] [--skill-name NAME]
Pass - as input to read from stdin. When --auto-refresh is active, the page reloads every 5 seconds, making it useful while a run loop is still executing.
Sources: skills/skill-creator/scripts/generate_report.py304-326
eval_review.html at skills/skill-creator/assets/eval_review.html1-147 is a browser-based tool for building and editing the labeled query set that feeds into the optimization loop.
Features:
| Feature | Implementation |
|---|---|
| Add query | addRow() — appends a blank entry, focuses the new textarea |
| Toggle should-trigger | updateTrigger() — toggles should_trigger boolean, re-renders |
| Edit query text | updateQuery() — updates in-memory array on change |
| Delete query | deleteRow() — splices by original index |
| Export | exportEvalSet() — downloads eval_set.json as a Blob |
The export format is a JSON array of {query, should_trigger} objects, which is the format consumed by run_eval.py.
The page is rendered server-side with placeholders (__SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __EVAL_DATA_PLACEHOLDER__) replaced before serving.
Sources: skills/skill-creator/assets/eval_review.html62-144
Diagram: Core Data Structures
Sources: skills/skill-creator/scripts/improve_description.py32-45 skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py138-188
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.