Description Optimization

Relevant source files

This page documents the description optimization subsystem inside skill-creator. Its purpose is to iteratively refine a SKILL.md description field so that Claude triggers the skill on the right queries and ignores it on irrelevant ones.

For background on how descriptions are used at runtime to match skills to user intent, see SKILL.md Format Specification. For the broader skill creation lifecycle that invokes description optimization as a final step, see Skill Creator Workflow. For reporting on prior evaluation runs (a related but distinct subsystem), see Benchmarking and Reporting.

Why Description Optimization Exists

A skill's description field (max 1024 characters) is the sole signal Claude uses to decide whether to invoke a skill for a given user query. A poorly worded description causes two failure modes:

Failure Mode	Meaning
Failed trigger	Skill should have activated but the description didn't match the query
False trigger	Skill activated when it should have stayed silent

The optimization loop treats this as a search problem: given a set of labeled test queries (should-trigger and should-not-trigger), find the description text that maximizes pass rate across both categories.

Key Files

File	Role
`skills/skill-creator/scripts/improve_description.py`	Core optimization logic; calls Claude API with extended thinking
`skills/skill-creator/scripts/generate_report.py`	Renders the iteration history as an HTML report
`skills/skill-creator/assets/eval_review.html`	Browser UI for building and exporting eval query sets

The Optimization Loop — High Level

Diagram: Description Optimization Data Flow

Sources: skills/skill-creator/scripts/improve_description.py1-248

`improve_description.py` — Core Function

Function Signature

improve_description() at skills/skill-creator/scripts/improve_description.py19-190 is the central entry point. Its parameters:

Parameter	Type	Description
`client`	`anthropic.Anthropic`	Initialized API client
`skill_name`	`str`	Skill identifier (for prompt context)
`skill_content`	`str`	Full SKILL.md body (for semantic context)
`current_description`	`str`	The description being replaced
`eval_results`	`dict`	Output from `run_eval.py`
`history`	`list[dict]`	All previous description attempts
`model`	`str`	Claude model identifier
`test_results`	`dict \| None`	Results on held-out test queries
`log_dir`	`Path \| None`	Directory to write transcript logs
`iteration`	`int \| None`	Current iteration number (for log naming)

Failure Classification

Before constructing the prompt, the function splits evaluation results into two categories:

failed_triggers = results where should_trigger=True  and pass=False
false_triggers  = results where should_trigger=False and pass=False

skills/skill-creator/scripts/improve_description.py32-39

Each entry in eval_results["results"] carries:

Field	Type	Meaning
`query`	`str`	The test query string
`should_trigger`	`bool`	Expected behavior
`pass`	`bool`	Whether the description passed this query
`triggers`	`int`	How many runs triggered (out of `runs`)
`runs`	`int`	Total runs executed for this query

The triggers/runs ratio appears in the prompt so Claude understands probabilistic failure (e.g., triggered 1/3 times) versus complete failure (0/3).

Prompt Construction

The prompt sent to Claude at skills/skill-creator/scripts/improve_description.py49-112 is assembled in sections:

Diagram: Prompt Assembly Sections

Sources: skills/skill-creator/scripts/improve_description.py49-112

Key constraints communicated to Claude in the prompt:

Generalize, don't overfit: Do not enumerate specific queries; identify broader categories of intent.
Word count: Aim for 100–200 words (not the hard limit of 1024 characters, but a practical cap to avoid consuming too much context across many skills).
Avoid repetition: The full history of previous attempts is included with the instruction "do NOT repeat these — try something structurally different."
Style: Descriptions should be written in the imperative ("Use this skill for…") and focus on user intent, not implementation.

API Call with Extended Thinking

skills/skill-creator/scripts/improve_description.py114-121

client.messages.create(
    model=model,
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": prompt}],
)

Extended thinking is deliberately enabled: the model reasons about why past attempts succeeded or failed before generating a new candidate. The thinking block is captured and written to the transcript log but is not parsed for the description itself.

The response is parsed for content between <new_description> and </new_description> tags:

skills/skill-creator/scripts/improve_description.py133-135

1024-Character Length Enforcement

If the extracted description exceeds 1024 characters (the hard limit imposed by the SKILL.md format), a follow-up API call is made in the same conversation thread:

skills/skill-creator/scripts/improve_description.py149-181

The follow-up prompt instructs the model to shorten the text while preserving the most important trigger words and intent coverage. This is a three-turn conversation: the original prompt, the model's long response, then the shortening request. Extended thinking is also enabled on the rewrite call.

Diagram: Length Constraint Handling

Sources: skills/skill-creator/scripts/improve_description.py149-181

Transcript Logging

Each call to improve_description() writes a JSON transcript to log_dir/improve_iter_N.json. The transcript contains:

Key	Contents
`iteration`	Iteration number
`prompt`	Full prompt sent to Claude
`thinking`	Extended thinking block text
`response`	Raw model text response
`parsed_description`	Description extracted from tags
`char_count`	Length of parsed description
`over_limit`	Boolean flag
`rewrite_prompt`	(if over limit) Second prompt
`rewrite_thinking`	(if over limit) Thinking from rewrite
`rewrite_response`	(if over limit) Raw rewrite response
`rewrite_description`	(if over limit) Shortened text
`rewrite_char_count`	(if over limit) Final length
`final_description`	The description actually returned

skills/skill-creator/scripts/improve_description.py138-188

History Tracking

At the end of each iteration, the caller appends the current attempt to the history list. The structure stored per history entry is:

skills/skill-creator/scripts/improve_description.py232-243

When the history is passed back into the next iteration's prompt, each past attempt is rendered as an <attempt> XML block showing the description, train/test scores, and per-query pass/fail status. This prevents the model from cycling back to previously tried phrasings.

Train vs. Test Split

The optimization loop supports a holdout set to detect overfitting. Queries in eval_results are used for training (the model sees these failures when generating improvements). Queries in test_results are a held-out set the optimizer never trains on directly — they are reported as a separate score.

generate_report.py names this the holdout count and tracks it in data["holdout"].

Diagram: Train/Test Query Flow

Sources: skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py42-45

`generate_report.py` — Iteration History Visualization

generate_report.py renders the full run loop history as a static HTML page. The generate_html() function at skills/skill-creator/scripts/generate_report.py16-301 takes:

Parameter	Type
`data`	`dict` — full run loop output JSON
`auto_refresh`	`bool` — adds `<meta http-equiv="refresh" content="5">` for live updates
`skill_name`	`str` — used in page title

Report Structure

The HTML table has one row per iteration. Columns are:

Column	Contents
`Iter`	Iteration number
`Train`	`correct/total_runs` across all train queries
`Test`	`correct/total_runs` across all test (holdout) queries
`Description`	The candidate description text
(per train query)	✓ or ✗ with `triggers/runs` sub-label
(per test query)	Same, visually distinguished with background color

Score cell CSS classes (score-good, score-ok, score-bad) are assigned by thresholds: ≥ 80% correct → good, ≥ 50% → ok, otherwise bad.

skills/skill-creator/scripts/generate_report.py244-256

The best-performing iteration (by test score if available, otherwise train score) receives the best-row CSS class for visual highlighting. skills/skill-creator/scripts/generate_report.py206-210

CLI Usage

generate_report.py <input.json> [-o output.html] [--skill-name NAME]

Pass - as input to read from stdin. When --auto-refresh is active, the page reloads every 5 seconds, making it useful while a run loop is still executing.

Sources: skills/skill-creator/scripts/generate_report.py304-326

Eval Set Review UI

eval_review.html at skills/skill-creator/assets/eval_review.html1-147 is a browser-based tool for building and editing the labeled query set that feeds into the optimization loop.

Features:

Feature	Implementation
Add query	`addRow()` — appends a blank entry, focuses the new textarea
Toggle should-trigger	`updateTrigger()` — toggles `should_trigger` boolean, re-renders
Edit query text	`updateQuery()` — updates in-memory array on change
Delete query	`deleteRow()` — splices by original index
Export	`exportEvalSet()` — downloads `eval_set.json` as a Blob

The export format is a JSON array of {query, should_trigger} objects, which is the format consumed by run_eval.py.

The page is rendered server-side with placeholders (__SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __EVAL_DATA_PLACEHOLDER__) replaced before serving.

Sources: skills/skill-creator/assets/eval_review.html62-144

Data Schema Summary

Diagram: Core Data Structures

Sources: skills/skill-creator/scripts/improve_description.py32-45 skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py138-188

Description Optimization

Relevant source files

Why Description Optimization Exists

A skill's description field (max 1024 characters) is the sole signal Claude uses to decide whether to invoke a skill for a given user query. A poorly worded description causes two failure modes:

Failure Mode	Meaning
Failed trigger	Skill should have activated but the description didn't match the query
False trigger	Skill activated when it should have stayed silent

Key Files

File	Role
`skills/skill-creator/scripts/improve_description.py`	Core optimization logic; calls Claude API with extended thinking
`skills/skill-creator/scripts/generate_report.py`	Renders the iteration history as an HTML report
`skills/skill-creator/assets/eval_review.html`	Browser UI for building and exporting eval query sets

The Optimization Loop — High Level

Diagram: Description Optimization Data Flow

Sources: skills/skill-creator/scripts/improve_description.py1-248

`improve_description.py` — Core Function

Function Signature

improve_description() at skills/skill-creator/scripts/improve_description.py19-190 is the central entry point. Its parameters:

Parameter	Type	Description
`client`	`anthropic.Anthropic`	Initialized API client
`skill_name`	`str`	Skill identifier (for prompt context)
`skill_content`	`str`	Full SKILL.md body (for semantic context)
`current_description`	`str`	The description being replaced
`eval_results`	`dict`	Output from `run_eval.py`
`history`	`list[dict]`	All previous description attempts
`model`	`str`	Claude model identifier
`test_results`	`dict \| None`	Results on held-out test queries
`log_dir`	`Path \| None`	Directory to write transcript logs
`iteration`	`int \| None`	Current iteration number (for log naming)

Failure Classification

Before constructing the prompt, the function splits evaluation results into two categories:

failed_triggers = results where should_trigger=True  and pass=False
false_triggers  = results where should_trigger=False and pass=False

skills/skill-creator/scripts/improve_description.py32-39

Each entry in eval_results["results"] carries:

Field	Type	Meaning
`query`	`str`	The test query string
`should_trigger`	`bool`	Expected behavior
`pass`	`bool`	Whether the description passed this query
`triggers`	`int`	How many runs triggered (out of `runs`)
`runs`	`int`	Total runs executed for this query

The triggers/runs ratio appears in the prompt so Claude understands probabilistic failure (e.g., triggered 1/3 times) versus complete failure (0/3).

Prompt Construction

The prompt sent to Claude at skills/skill-creator/scripts/improve_description.py49-112 is assembled in sections:

Diagram: Prompt Assembly Sections

Sources: skills/skill-creator/scripts/improve_description.py49-112

Key constraints communicated to Claude in the prompt:

Generalize, don't overfit: Do not enumerate specific queries; identify broader categories of intent.
Word count: Aim for 100–200 words (not the hard limit of 1024 characters, but a practical cap to avoid consuming too much context across many skills).
Avoid repetition: The full history of previous attempts is included with the instruction "do NOT repeat these — try something structurally different."
Style: Descriptions should be written in the imperative ("Use this skill for…") and focus on user intent, not implementation.

API Call with Extended Thinking

skills/skill-creator/scripts/improve_description.py114-121

client.messages.create(
    model=model,
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": prompt}],
)

The response is parsed for content between <new_description> and </new_description> tags:

skills/skill-creator/scripts/improve_description.py133-135

1024-Character Length Enforcement

If the extracted description exceeds 1024 characters (the hard limit imposed by the SKILL.md format), a follow-up API call is made in the same conversation thread:

skills/skill-creator/scripts/improve_description.py149-181

Diagram: Length Constraint Handling

Sources: skills/skill-creator/scripts/improve_description.py149-181

Transcript Logging

Each call to improve_description() writes a JSON transcript to log_dir/improve_iter_N.json. The transcript contains:

Key	Contents
`iteration`	Iteration number
`prompt`	Full prompt sent to Claude
`thinking`	Extended thinking block text
`response`	Raw model text response
`parsed_description`	Description extracted from tags
`char_count`	Length of parsed description
`over_limit`	Boolean flag
`rewrite_prompt`	(if over limit) Second prompt
`rewrite_thinking`	(if over limit) Thinking from rewrite
`rewrite_response`	(if over limit) Raw rewrite response
`rewrite_description`	(if over limit) Shortened text
`rewrite_char_count`	(if over limit) Final length
`final_description`	The description actually returned

skills/skill-creator/scripts/improve_description.py138-188

History Tracking

At the end of each iteration, the caller appends the current attempt to the history list. The structure stored per history entry is:

skills/skill-creator/scripts/improve_description.py232-243

Train vs. Test Split

generate_report.py names this the holdout count and tracks it in data["holdout"].

Diagram: Train/Test Query Flow

Sources: skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py42-45

`generate_report.py` — Iteration History Visualization

generate_report.py renders the full run loop history as a static HTML page. The generate_html() function at skills/skill-creator/scripts/generate_report.py16-301 takes:

Parameter	Type
`data`	`dict` — full run loop output JSON
`auto_refresh`	`bool` — adds `<meta http-equiv="refresh" content="5">` for live updates
`skill_name`	`str` — used in page title

Report Structure

The HTML table has one row per iteration. Columns are:

Column	Contents
`Iter`	Iteration number
`Train`	`correct/total_runs` across all train queries
`Test`	`correct/total_runs` across all test (holdout) queries
`Description`	The candidate description text
(per train query)	✓ or ✗ with `triggers/runs` sub-label
(per test query)	Same, visually distinguished with background color

Score cell CSS classes (score-good, score-ok, score-bad) are assigned by thresholds: ≥ 80% correct → good, ≥ 50% → ok, otherwise bad.

skills/skill-creator/scripts/generate_report.py244-256

The best-performing iteration (by test score if available, otherwise train score) receives the best-row CSS class for visual highlighting. skills/skill-creator/scripts/generate_report.py206-210

CLI Usage

generate_report.py <input.json> [-o output.html] [--skill-name NAME]

Pass - as input to read from stdin. When --auto-refresh is active, the page reloads every 5 seconds, making it useful while a run loop is still executing.

Sources: skills/skill-creator/scripts/generate_report.py304-326

Eval Set Review UI

eval_review.html at skills/skill-creator/assets/eval_review.html1-147 is a browser-based tool for building and editing the labeled query set that feeds into the optimization loop.

Features:

Feature	Implementation
Add query	`addRow()` — appends a blank entry, focuses the new textarea
Toggle should-trigger	`updateTrigger()` — toggles `should_trigger` boolean, re-renders
Edit query text	`updateQuery()` — updates in-memory array on change
Delete query	`deleteRow()` — splices by original index
Export	`exportEvalSet()` — downloads `eval_set.json` as a Blob

The export format is a JSON array of {query, should_trigger} objects, which is the format consumed by run_eval.py.

The page is rendered server-side with placeholders (__SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __EVAL_DATA_PLACEHOLDER__) replaced before serving.

Sources: skills/skill-creator/assets/eval_review.html62-144

Data Schema Summary

Diagram: Core Data Structures

Sources: skills/skill-creator/scripts/improve_description.py32-45 skills/skill-creator/scripts/generate_report.py19-31 skills/skill-creator/scripts/improve_description.py138-188

Description Optimization

Why Description Optimization Exists

Key Files

The Optimization Loop — High Level

`improve_description.py` — Core Function

Function Signature

Failure Classification

Prompt Construction

API Call with Extended Thinking

1024-Character Length Enforcement

Transcript Logging

History Tracking

Train vs. Test Split

`generate_report.py` — Iteration History Visualization

Report Structure

CLI Usage

Eval Set Review UI

Data Schema Summary

On this page

Description Optimization

Why Description Optimization Exists

Key Files

The Optimization Loop — High Level

`improve_description.py` — Core Function

Function Signature

Failure Classification

Prompt Construction

API Call with Extended Thinking

1024-Character Length Enforcement

Transcript Logging

History Tracking

Train vs. Test Split

`generate_report.py` — Iteration History Visualization

Report Structure

CLI Usage

Eval Set Review UI

Data Schema Summary

On this page

Description Optimization

Why Description Optimization Exists

Key Files

The Optimization Loop — High Level

improve_description.py — Core Function

Function Signature

Failure Classification

Prompt Construction

API Call with Extended Thinking

1024-Character Length Enforcement

Transcript Logging

History Tracking

Train vs. Test Split

generate_report.py — Iteration History Visualization

Report Structure

CLI Usage

Eval Set Review UI

Data Schema Summary

On this page

Description Optimization

Why Description Optimization Exists

Key Files

The Optimization Loop — High Level

improve_description.py — Core Function

Function Signature

Failure Classification

Prompt Construction

API Call with Extended Thinking

1024-Character Length Enforcement

Transcript Logging

History Tracking

Train vs. Test Split

generate_report.py — Iteration History Visualization

Report Structure

CLI Usage

Eval Set Review UI

Data Schema Summary

On this page

`improve_description.py` — Core Function

`generate_report.py` — Iteration History Visualization

`improve_description.py` — Core Function

`generate_report.py` — Iteration History Visualization