Evaluation Pipeline

Relevant source files

This page documents the three evaluation agents inside skill-creator and how they chain together into a single pipeline: the Grader, the Blind Comparator, and the Post-hoc Analyzer. Each agent is defined by a prompt file in skills/skill-creator/agents/ and produces a structured JSON artifact consumed by the next stage.

For the broader workflow that orchestrates this pipeline (parallel executor runs, iteration, packaging), see Skill Creator Workflow. For how the pipeline's outputs are aggregated into statistics and visualized, see Benchmarking and Reporting and Eval Viewer and Feedback Collection.

Pipeline Overview

Each evaluation run produces a set of artifacts in a directory hierarchy. The three agents operate sequentially on those artifacts:

Pipeline: Agent Chain and Artifact Flow

Sources: skills/skill-creator/agents/grader.md1-50 skills/skill-creator/agents/comparator.md1-30 skills/skill-creator/agents/analyzer.md1-30

Grader Agent

File: skills/skill-creator/agents/grader.md

The Grader is the first agent to run after each executor subagent completes. It has two responsibilities:

Grade the execution output against a list of predefined expectations.
Critique the expectations themselves (the "evals") for coverage gaps.

Inputs

Parameter	Description
`expectations`	List of assertion strings from `evals.json`
`transcript_path`	Path to `outputs/transcript.md`
`outputs_dir`	Directory containing output files

Grading Process

The Grader follows an eight-step process defined in skills/skill-creator/agents/grader.md19-107:

Read transcript — full execution trace including steps, tool calls, and errors.
Examine output files — reads all files in outputs_dir; does not rely solely on what the transcript claims was produced.
Evaluate each assertion — produces PASS or FAIL with a mandatory evidence citation.
Extract and verify implicit claims — goes beyond predefined expectations to flag factual, process, and quality claims found in the output.
Read user_notes.md — an optional file the executor may write to flag uncertainties or workarounds.
Critique the evals — flags assertions that are trivially satisfied or important outcomes with no assertion covering them.
Write grading.json — output saved as a sibling of outputs_dir.
Read metrics.json and timing.json — execution statistics are folded into the grading output.

Verdict Criteria

Verdict	Condition
`PASS`	Clear evidence in transcript or outputs; evidence reflects genuine substance, not surface compliance
`FAIL`	No evidence, contradictory evidence, unverifiable assertion, or superficially satisfied assertion

The burden of proof is on the assertion: when uncertain, the Grader fails the expectation.

grading.json Schema

grading.json
├── expectations[]          # Graded assertions (text, passed, evidence)
├── summary                 # passed / failed / total / pass_rate
├── execution_metrics       # tool_calls, total_steps, output_chars (from metrics.json)
├── timing                  # executor_duration_seconds, total_duration_seconds (from timing.json)
├── claims[]                # Extracted claims (claim, type, verified, evidence)
├── user_notes_summary      # uncertainties / needs_review / workarounds
└── eval_feedback           # (optional) suggestions[] + overall assessment

Sources: skills/skill-creator/agents/grader.md1-215 skills/skill-creator/references/schemas.md88-160

File: skills/skill-creator/agents/comparator.md

The Blind Comparator receives the two execution outputs — labeled A and B — without knowing which skill produced which. This blinding prevents bias toward a known baseline or a known candidate skill. It judges purely on output quality.

Inputs

Parameter	Description
`output_a_path`	Path to output file or directory from run A
`output_b_path`	Path to output file or directory from run B
`eval_prompt`	The original task prompt from `evals.json`
`expectations`	(Optional) assertion list for secondary scoring

Evaluation Process

The Comparator follows a seven-step process defined in skills/skill-creator/agents/comparator.md22-90:

Read both outputs — examines all files in each output directory.
Understand the task — parses what success requires for the specific eval_prompt.
Generate a dynamic rubric — creates task-specific criteria within two fixed dimensions: Content and Structure.
Score each output — rates each criterion 1–5, then computes dimension averages and a 1–10 overall score.
Check assertions (if provided) — computes pass rates as secondary evidence only.
Determine the winner — primary: rubric score; secondary: assertion pass rate; tiebreaker: TIE.
Write comparison.json.

Dynamic Rubric Structure

The rubric has two fixed dimensions with task-adapted criteria:

Content Rubric (what the output contains):

Criterion	1 — Poor	3 — Acceptable	5 — Excellent
Correctness	Major errors	Minor errors	Fully correct
Completeness	Missing key elements	Mostly complete	All elements present
Accuracy	Significant inaccuracies	Minor inaccuracies	Accurate throughout

Structure Rubric (how the output is organized):

Criterion	1 — Poor	3 — Acceptable	5 — Excellent
Organization	Disorganized	Reasonably organized	Clear, logical structure
Formatting	Inconsistent/broken	Mostly consistent	Professional, polished
Usability	Difficult to use	Usable with effort	Easy to use

For task-specific evaluations (e.g., PDF form filling), the Comparator replaces these defaults with relevant criteria such as "Field alignment" or "Schema correctness".

comparison.json Schema

comparison.json
├── winner                  # "A", "B", or "TIE"
├── reasoning               # Plain-text explanation of decision
├── rubric
│   ├── A
│   │   ├── content         # { correctness, completeness, accuracy } (1-5 each)
│   │   ├── structure       # { organization, formatting, usability } (1-5 each)
│   │   ├── content_score   # average of content criteria
│   │   ├── structure_score # average of structure criteria
│   │   └── overall_score   # combined score scaled to 1-10
│   └── B                   # same structure
├── output_quality
│   ├── A                   # { score (1-10), strengths[], weaknesses[] }
│   └── B                   # same structure
└── expectation_results     # (only if expectations provided)
    ├── A                   # { passed, total, pass_rate, details[] }
    └── B

Sources: skills/skill-creator/agents/comparator.md1-203 skills/skill-creator/references/schemas.md309-379

Post-hoc Analyzer Agent

File: skills/skill-creator/agents/analyzer.md

The Post-hoc Analyzer "unblinds" the comparison: it now knows which skill produced which output, and uses that knowledge to explain why the winner won and generate actionable improvement suggestions for the loser skill. This is the final stage of the per-eval pipeline.

The agent has a secondary role during benchmarking: it reads across all run results and surfaces patterns that aggregate statistics would hide (see Benchmarking and Reporting for that mode).

Inputs (per-eval mode)

Parameter	Description
`winner`	`"A"` or `"B"` (from `comparison.json`)
`winner_skill_path`	Path to the SKILL.md that produced the winning output
`winner_transcript_path`	Transcript for the winning run
`loser_skill_path`	Path to the SKILL.md that produced the losing output
`loser_transcript_path`	Transcript for the losing run
`comparison_result_path`	Path to `comparison.json`
`output_path`	Where to write `analysis.json`

Analysis Process

The Analyzer follows an eight-step process defined in skills/skill-creator/agents/analyzer.md22-88:

Read comparison result — understands what the Comparator valued.
Read both skills — compares instruction clarity, script/tool usage, example coverage, and edge case handling.
Read both transcripts — compares actual execution patterns.
Analyze instruction following — scores each run 1–10 for how closely it adhered to its skill's instructions; notes specific deviations.
Identify winner strengths — what in the winning skill caused better behavior (clearer instructions, better scripts, better examples).
Identify loser weaknesses — what in the losing skill caused suboptimal behavior (ambiguous instructions, missing tools, gaps in coverage).
Generate improvement suggestions — actionable, prioritized changes to the losing skill, organized by category.
Write analysis.json.

Improvement Suggestion Categories

Category	Targets
`instructions`	Prose instruction changes in SKILL.md
`tools`	Scripts, templates, or utilities to add/modify
`examples`	Example inputs/outputs to include
`error_handling`	Guidance for failure cases
`structure`	Reorganization of skill content
`references`	External documentation to link

Priority Levels

Priority	Meaning
`high`	Would likely change the win/loss outcome
`medium`	Improves quality but may not change the verdict
`low`	Marginal improvement

analysis.json Schema

analysis.json
├── comparison_summary
│   ├── winner              # "A" or "B"
│   ├── winner_skill        # path to winning SKILL.md
│   ├── loser_skill         # path to losing SKILL.md
│   └── comparator_reasoning
├── winner_strengths[]      # Specific qualities that led to the win
├── loser_weaknesses[]      # Specific deficiencies that caused the loss
├── instruction_following
│   ├── winner              # { score (1-10), issues[] }
│   └── loser               # { score (1-10), issues[] }
├── improvement_suggestions[]
│   ├── priority            # "high" | "medium" | "low"
│   ├── category            # see category table above
│   ├── suggestion          # concrete change to make
│   └── expected_impact     # why this would help
└── transcript_insights
    ├── winner_execution_pattern
    └── loser_execution_pattern

Sources: skills/skill-creator/agents/analyzer.md1-185 skills/skill-creator/references/schemas.md384-430

Agent–File Mapping

The following diagram maps each agent prompt file to the artifacts it consumes and produces, using the exact file names from the codebase.

Agent–Artifact Mapping

Sources: skills/skill-creator/agents/grader.md12-18 skills/skill-creator/agents/comparator.md12-18 skills/skill-creator/agents/analyzer.md12-19

Eval Feedback Loop

The Grader's eval_feedback field closes a loop back to the eval author. It is only written when the Grader identifies a meaningful gap — not on every run.

The bar is described in skills/skill-creator/agents/grader.md69-80:

An assertion that would pass even for a clearly wrong output (e.g., checking filename existence but not file content).
An important observed outcome that no assertion covers.
An assertion that cannot be verified from available outputs.

The eval_feedback.suggestions array links each suggestion to the specific assertion it concerns (via an optional assertion field), or to a missing assertion (suggestion without an assertion key). This data is surfaced in the eval viewer (see Eval Viewer and Feedback Collection).

eval_feedback
├── suggestions[]
│   ├── assertion   # (optional) the existing assertion this concerns
│   └── reason      # why the assertion is weak or missing
└── overall         # summary assessment; "No suggestions, evals look solid" if none

Sources: skills/skill-creator/agents/grader.md69-80 skills/skill-creator/references/schemas.md140-160

Evaluation Pipeline

Relevant source files

Pipeline Overview

Each evaluation run produces a set of artifacts in a directory hierarchy. The three agents operate sequentially on those artifacts:

Pipeline: Agent Chain and Artifact Flow

Sources: skills/skill-creator/agents/grader.md1-50 skills/skill-creator/agents/comparator.md1-30 skills/skill-creator/agents/analyzer.md1-30

Grader Agent

File: skills/skill-creator/agents/grader.md

The Grader is the first agent to run after each executor subagent completes. It has two responsibilities:

Grade the execution output against a list of predefined expectations.
Critique the expectations themselves (the "evals") for coverage gaps.

Inputs

Parameter	Description
`expectations`	List of assertion strings from `evals.json`
`transcript_path`	Path to `outputs/transcript.md`
`outputs_dir`	Directory containing output files

Grading Process

The Grader follows an eight-step process defined in skills/skill-creator/agents/grader.md19-107:

Read transcript — full execution trace including steps, tool calls, and errors.
Examine output files — reads all files in outputs_dir; does not rely solely on what the transcript claims was produced.
Evaluate each assertion — produces PASS or FAIL with a mandatory evidence citation.
Extract and verify implicit claims — goes beyond predefined expectations to flag factual, process, and quality claims found in the output.
Read user_notes.md — an optional file the executor may write to flag uncertainties or workarounds.
Critique the evals — flags assertions that are trivially satisfied or important outcomes with no assertion covering them.
Write grading.json — output saved as a sibling of outputs_dir.
Read metrics.json and timing.json — execution statistics are folded into the grading output.

Verdict Criteria

Verdict	Condition
`PASS`	Clear evidence in transcript or outputs; evidence reflects genuine substance, not surface compliance
`FAIL`	No evidence, contradictory evidence, unverifiable assertion, or superficially satisfied assertion

The burden of proof is on the assertion: when uncertain, the Grader fails the expectation.

grading.json Schema

grading.json
├── expectations[]          # Graded assertions (text, passed, evidence)
├── summary                 # passed / failed / total / pass_rate
├── execution_metrics       # tool_calls, total_steps, output_chars (from metrics.json)
├── timing                  # executor_duration_seconds, total_duration_seconds (from timing.json)
├── claims[]                # Extracted claims (claim, type, verified, evidence)
├── user_notes_summary      # uncertainties / needs_review / workarounds
└── eval_feedback           # (optional) suggestions[] + overall assessment

Sources: skills/skill-creator/agents/grader.md1-215 skills/skill-creator/references/schemas.md88-160

File: skills/skill-creator/agents/comparator.md

Inputs

Parameter	Description
`output_a_path`	Path to output file or directory from run A
`output_b_path`	Path to output file or directory from run B
`eval_prompt`	The original task prompt from `evals.json`
`expectations`	(Optional) assertion list for secondary scoring

Evaluation Process

The Comparator follows a seven-step process defined in skills/skill-creator/agents/comparator.md22-90:

Read both outputs — examines all files in each output directory.
Understand the task — parses what success requires for the specific eval_prompt.
Generate a dynamic rubric — creates task-specific criteria within two fixed dimensions: Content and Structure.
Score each output — rates each criterion 1–5, then computes dimension averages and a 1–10 overall score.
Check assertions (if provided) — computes pass rates as secondary evidence only.
Determine the winner — primary: rubric score; secondary: assertion pass rate; tiebreaker: TIE.
Write comparison.json.

Dynamic Rubric Structure

The rubric has two fixed dimensions with task-adapted criteria:

Content Rubric (what the output contains):

Criterion	1 — Poor	3 — Acceptable	5 — Excellent
Correctness	Major errors	Minor errors	Fully correct
Completeness	Missing key elements	Mostly complete	All elements present
Accuracy	Significant inaccuracies	Minor inaccuracies	Accurate throughout

Structure Rubric (how the output is organized):

Criterion	1 — Poor	3 — Acceptable	5 — Excellent
Organization	Disorganized	Reasonably organized	Clear, logical structure
Formatting	Inconsistent/broken	Mostly consistent	Professional, polished
Usability	Difficult to use	Usable with effort	Easy to use

For task-specific evaluations (e.g., PDF form filling), the Comparator replaces these defaults with relevant criteria such as "Field alignment" or "Schema correctness".

comparison.json Schema

comparison.json
├── winner                  # "A", "B", or "TIE"
├── reasoning               # Plain-text explanation of decision
├── rubric
│   ├── A
│   │   ├── content         # { correctness, completeness, accuracy } (1-5 each)
│   │   ├── structure       # { organization, formatting, usability } (1-5 each)
│   │   ├── content_score   # average of content criteria
│   │   ├── structure_score # average of structure criteria
│   │   └── overall_score   # combined score scaled to 1-10
│   └── B                   # same structure
├── output_quality
│   ├── A                   # { score (1-10), strengths[], weaknesses[] }
│   └── B                   # same structure
└── expectation_results     # (only if expectations provided)
    ├── A                   # { passed, total, pass_rate, details[] }
    └── B

Sources: skills/skill-creator/agents/comparator.md1-203 skills/skill-creator/references/schemas.md309-379

Post-hoc Analyzer Agent

File: skills/skill-creator/agents/analyzer.md

The agent has a secondary role during benchmarking: it reads across all run results and surfaces patterns that aggregate statistics would hide (see Benchmarking and Reporting for that mode).

Inputs (per-eval mode)

Parameter	Description
`winner`	`"A"` or `"B"` (from `comparison.json`)
`winner_skill_path`	Path to the SKILL.md that produced the winning output
`winner_transcript_path`	Transcript for the winning run
`loser_skill_path`	Path to the SKILL.md that produced the losing output
`loser_transcript_path`	Transcript for the losing run
`comparison_result_path`	Path to `comparison.json`
`output_path`	Where to write `analysis.json`

Analysis Process

The Analyzer follows an eight-step process defined in skills/skill-creator/agents/analyzer.md22-88:

Read comparison result — understands what the Comparator valued.
Read both skills — compares instruction clarity, script/tool usage, example coverage, and edge case handling.
Read both transcripts — compares actual execution patterns.
Analyze instruction following — scores each run 1–10 for how closely it adhered to its skill's instructions; notes specific deviations.
Identify winner strengths — what in the winning skill caused better behavior (clearer instructions, better scripts, better examples).
Identify loser weaknesses — what in the losing skill caused suboptimal behavior (ambiguous instructions, missing tools, gaps in coverage).
Generate improvement suggestions — actionable, prioritized changes to the losing skill, organized by category.
Write analysis.json.

Improvement Suggestion Categories

Category	Targets
`instructions`	Prose instruction changes in SKILL.md
`tools`	Scripts, templates, or utilities to add/modify
`examples`	Example inputs/outputs to include
`error_handling`	Guidance for failure cases
`structure`	Reorganization of skill content
`references`	External documentation to link

Priority Levels

Priority	Meaning
`high`	Would likely change the win/loss outcome
`medium`	Improves quality but may not change the verdict
`low`	Marginal improvement

analysis.json Schema

analysis.json
├── comparison_summary
│   ├── winner              # "A" or "B"
│   ├── winner_skill        # path to winning SKILL.md
│   ├── loser_skill         # path to losing SKILL.md
│   └── comparator_reasoning
├── winner_strengths[]      # Specific qualities that led to the win
├── loser_weaknesses[]      # Specific deficiencies that caused the loss
├── instruction_following
│   ├── winner              # { score (1-10), issues[] }
│   └── loser               # { score (1-10), issues[] }
├── improvement_suggestions[]
│   ├── priority            # "high" | "medium" | "low"
│   ├── category            # see category table above
│   ├── suggestion          # concrete change to make
│   └── expected_impact     # why this would help
└── transcript_insights
    ├── winner_execution_pattern
    └── loser_execution_pattern

Sources: skills/skill-creator/agents/analyzer.md1-185 skills/skill-creator/references/schemas.md384-430

Agent–File Mapping

The following diagram maps each agent prompt file to the artifacts it consumes and produces, using the exact file names from the codebase.

Agent–Artifact Mapping

Sources: skills/skill-creator/agents/grader.md12-18 skills/skill-creator/agents/comparator.md12-18 skills/skill-creator/agents/analyzer.md12-19

Eval Feedback Loop

The Grader's eval_feedback field closes a loop back to the eval author. It is only written when the Grader identifies a meaningful gap — not on every run.

The bar is described in skills/skill-creator/agents/grader.md69-80:

An assertion that would pass even for a clearly wrong output (e.g., checking filename existence but not file content).
An important observed outcome that no assertion covers.
An assertion that cannot be verified from available outputs.

eval_feedback
├── suggestions[]
│   ├── assertion   # (optional) the existing assertion this concerns
│   └── reason      # why the assertion is weak or missing
└── overall         # summary assessment; "No suggestions, evals look solid" if none

Sources: skills/skill-creator/agents/grader.md69-80 skills/skill-creator/references/schemas.md140-160

Evaluation Pipeline

Pipeline Overview

Grader Agent

Inputs

Grading Process

Verdict Criteria

grading.json Schema

Blind Comparator Agent

Inputs

Evaluation Process

Dynamic Rubric Structure

comparison.json Schema

Post-hoc Analyzer Agent

Inputs (per-eval mode)

Analysis Process

Improvement Suggestion Categories

Priority Levels

analysis.json Schema

Agent–File Mapping

Eval Feedback Loop

On this page

Evaluation Pipeline

Pipeline Overview

Grader Agent

Inputs

Grading Process

Verdict Criteria

grading.json Schema

Blind Comparator Agent

Inputs

Evaluation Process

Dynamic Rubric Structure

comparison.json Schema

Post-hoc Analyzer Agent

Inputs (per-eval mode)

Analysis Process

Improvement Suggestion Categories

Priority Levels

analysis.json Schema

Agent–File Mapping

Eval Feedback Loop

On this page