This page documents the three evaluation agents inside skill-creator and how they chain together into a single pipeline: the Grader, the Blind Comparator, and the Post-hoc Analyzer. Each agent is defined by a prompt file in skills/skill-creator/agents/ and produces a structured JSON artifact consumed by the next stage.
For the broader workflow that orchestrates this pipeline (parallel executor runs, iteration, packaging), see Skill Creator Workflow. For how the pipeline's outputs are aggregated into statistics and visualized, see Benchmarking and Reporting and Eval Viewer and Feedback Collection.
Each evaluation run produces a set of artifacts in a directory hierarchy. The three agents operate sequentially on those artifacts:
Pipeline: Agent Chain and Artifact Flow
Sources: skills/skill-creator/agents/grader.md1-50 skills/skill-creator/agents/comparator.md1-30 skills/skill-creator/agents/analyzer.md1-30
File: skills/skill-creator/agents/grader.md
The Grader is the first agent to run after each executor subagent completes. It has two responsibilities:
| Parameter | Description |
|---|---|
expectations | List of assertion strings from evals.json |
transcript_path | Path to outputs/transcript.md |
outputs_dir | Directory containing output files |
The Grader follows an eight-step process defined in skills/skill-creator/agents/grader.md19-107:
outputs_dir; does not rely solely on what the transcript claims was produced.PASS or FAIL with a mandatory evidence citation.user_notes.md — an optional file the executor may write to flag uncertainties or workarounds.grading.json — output saved as a sibling of outputs_dir.metrics.json and timing.json — execution statistics are folded into the grading output.| Verdict | Condition |
|---|---|
PASS | Clear evidence in transcript or outputs; evidence reflects genuine substance, not surface compliance |
FAIL | No evidence, contradictory evidence, unverifiable assertion, or superficially satisfied assertion |
The burden of proof is on the assertion: when uncertain, the Grader fails the expectation.
grading.json
├── expectations[] # Graded assertions (text, passed, evidence)
├── summary # passed / failed / total / pass_rate
├── execution_metrics # tool_calls, total_steps, output_chars (from metrics.json)
├── timing # executor_duration_seconds, total_duration_seconds (from timing.json)
├── claims[] # Extracted claims (claim, type, verified, evidence)
├── user_notes_summary # uncertainties / needs_review / workarounds
└── eval_feedback # (optional) suggestions[] + overall assessment
Sources: skills/skill-creator/agents/grader.md1-215 skills/skill-creator/references/schemas.md88-160
File: skills/skill-creator/agents/comparator.md
The Blind Comparator receives the two execution outputs — labeled A and B — without knowing which skill produced which. This blinding prevents bias toward a known baseline or a known candidate skill. It judges purely on output quality.
| Parameter | Description |
|---|---|
output_a_path | Path to output file or directory from run A |
output_b_path | Path to output file or directory from run B |
eval_prompt | The original task prompt from evals.json |
expectations | (Optional) assertion list for secondary scoring |
The Comparator follows a seven-step process defined in skills/skill-creator/agents/comparator.md22-90:
eval_prompt.TIE.comparison.json.The rubric has two fixed dimensions with task-adapted criteria:
Content Rubric (what the output contains):
| Criterion | 1 — Poor | 3 — Acceptable | 5 — Excellent |
|---|---|---|---|
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |
| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
Structure Rubric (how the output is organized):
| Criterion | 1 — Poor | 3 — Acceptable | 5 — Excellent |
|---|---|---|---|
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
| Usability | Difficult to use | Usable with effort | Easy to use |
For task-specific evaluations (e.g., PDF form filling), the Comparator replaces these defaults with relevant criteria such as "Field alignment" or "Schema correctness".
comparison.json
├── winner # "A", "B", or "TIE"
├── reasoning # Plain-text explanation of decision
├── rubric
│ ├── A
│ │ ├── content # { correctness, completeness, accuracy } (1-5 each)
│ │ ├── structure # { organization, formatting, usability } (1-5 each)
│ │ ├── content_score # average of content criteria
│ │ ├── structure_score # average of structure criteria
│ │ └── overall_score # combined score scaled to 1-10
│ └── B # same structure
├── output_quality
│ ├── A # { score (1-10), strengths[], weaknesses[] }
│ └── B # same structure
└── expectation_results # (only if expectations provided)
├── A # { passed, total, pass_rate, details[] }
└── B
Sources: skills/skill-creator/agents/comparator.md1-203 skills/skill-creator/references/schemas.md309-379
File: skills/skill-creator/agents/analyzer.md
The Post-hoc Analyzer "unblinds" the comparison: it now knows which skill produced which output, and uses that knowledge to explain why the winner won and generate actionable improvement suggestions for the loser skill. This is the final stage of the per-eval pipeline.
The agent has a secondary role during benchmarking: it reads across all run results and surfaces patterns that aggregate statistics would hide (see Benchmarking and Reporting for that mode).
| Parameter | Description |
|---|---|
winner | "A" or "B" (from comparison.json) |
winner_skill_path | Path to the SKILL.md that produced the winning output |
winner_transcript_path | Transcript for the winning run |
loser_skill_path | Path to the SKILL.md that produced the losing output |
loser_transcript_path | Transcript for the losing run |
comparison_result_path | Path to comparison.json |
output_path | Where to write analysis.json |
The Analyzer follows an eight-step process defined in skills/skill-creator/agents/analyzer.md22-88:
analysis.json.| Category | Targets |
|---|---|
instructions | Prose instruction changes in SKILL.md |
tools | Scripts, templates, or utilities to add/modify |
examples | Example inputs/outputs to include |
error_handling | Guidance for failure cases |
structure | Reorganization of skill content |
references | External documentation to link |
| Priority | Meaning |
|---|---|
high | Would likely change the win/loss outcome |
medium | Improves quality but may not change the verdict |
low | Marginal improvement |
analysis.json
├── comparison_summary
│ ├── winner # "A" or "B"
│ ├── winner_skill # path to winning SKILL.md
│ ├── loser_skill # path to losing SKILL.md
│ └── comparator_reasoning
├── winner_strengths[] # Specific qualities that led to the win
├── loser_weaknesses[] # Specific deficiencies that caused the loss
├── instruction_following
│ ├── winner # { score (1-10), issues[] }
│ └── loser # { score (1-10), issues[] }
├── improvement_suggestions[]
│ ├── priority # "high" | "medium" | "low"
│ ├── category # see category table above
│ ├── suggestion # concrete change to make
│ └── expected_impact # why this would help
└── transcript_insights
├── winner_execution_pattern
└── loser_execution_pattern
Sources: skills/skill-creator/agents/analyzer.md1-185 skills/skill-creator/references/schemas.md384-430
The following diagram maps each agent prompt file to the artifacts it consumes and produces, using the exact file names from the codebase.
Agent–Artifact Mapping
Sources: skills/skill-creator/agents/grader.md12-18 skills/skill-creator/agents/comparator.md12-18 skills/skill-creator/agents/analyzer.md12-19
The Grader's eval_feedback field closes a loop back to the eval author. It is only written when the Grader identifies a meaningful gap — not on every run.
The bar is described in skills/skill-creator/agents/grader.md69-80:
The eval_feedback.suggestions array links each suggestion to the specific assertion it concerns (via an optional assertion field), or to a missing assertion (suggestion without an assertion key). This data is surfaced in the eval viewer (see Eval Viewer and Feedback Collection).
eval_feedback
├── suggestions[]
│ ├── assertion # (optional) the existing assertion this concerns
│ └── reason # why the assertion is weak or missing
└── overall # summary assessment; "No suggestions, evals look solid" if none
Sources: skills/skill-creator/agents/grader.md69-80 skills/skill-creator/references/schemas.md140-160
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.