Eval Viewer and Feedback Collection

Relevant source files

This page documents the eval viewer subsystem inside skill-creator: the generate_review.py server, the viewer.html review interface, and the eval_review.html eval-set builder. Together these tools allow a developer to inspect evaluation run outputs, record qualitative feedback, and compare runs across skill iterations.

For the upstream pipeline that produces the run outputs and grading data consumed here, see Evaluation Pipeline. For the benchmark statistics that appear in the viewer's Benchmark tab, see Benchmarking and Reporting. For the full skill creation lifecycle that calls the viewer, see Skill Creator Workflow.

Component Overview

Component-to-Code Map

Sources: skills/skill-creator/eval-viewer/generate_review.py1-471 skills/skill-creator/eval-viewer/viewer.html1-1326 skills/skill-creator/assets/eval_review.html1-147

generate_review.py

CLI Interface

The script is invoked directly by skill-creator after evaluation runs complete:

python generate_review.py <workspace-path> [options]

Flag	Default	Purpose
`--port` / `-p`	`3117`	HTTP server port
`--skill-name` / `-n`	derived from workspace dir name	Header label in UI
`--previous-workspace`	none	Path to prior iteration workspace for comparison
`--benchmark`	none	Path to `benchmark.json` for the Benchmark tab
`--static` / `-s`	none	Write standalone HTML to a file instead of serving

If a process is already listening on the target port, _kill_port() uses lsof to terminate it before binding. If the port is still unavailable after the kill, the server falls back to an OS-assigned port.

Sources: skills/skill-creator/eval-viewer/generate_review.py387-467

Workspace Scanning

find_runs() recursively walks a workspace directory looking for any subdirectory that contains an outputs/ child. It skips node_modules, .git, __pycache__, skill, and inputs. The result is sorted by (eval_id, run_id).

Workspace Directory Traversal

Sources: skills/skill-creator/eval-viewer/generate_review.py60-146

File Embedding

embed_file() classifies each output file and produces a typed dict for the browser. The id field is the run's path relative to the workspace root, with / replaced by -.

File type	Condition	Dict fields
Text	extension in `TEXT_EXTENSIONS`	`type: "text"`, `content: str`
Image	extension in `IMAGE_EXTENSIONS`	`type: "image"`, `mime`, `data_uri` (base64)
PDF	`.pdf`	`type: "pdf"`, `data_uri` (base64)
XLSX	`.xlsx`	`type: "xlsx"`, `data_b64` (base64)
Other binary	anything else	`type: "binary"`, `mime`, `data_uri` (base64)

TEXT_EXTENSIONS covers .txt, .md, .json, .csv, .py, .js, .ts, .tsx, .jsx, .yaml, .yml, .xml, .html, .css, .sh, .rb, .go, .rs, .java, .c, .cpp, .h, .hpp, .sql, .r, .toml.

Sources: skills/skill-creator/eval-viewer/generate_review.py149-211

HTML Generation

generate_html() reads viewer.html and replaces the literal marker /*__EMBEDDED_DATA__*/ with a JavaScript assignment:

const EMBEDDED_DATA = <json blob>;

The JSON blob has this shape:

{
  "skill_name": str,
  "runs": [ run_dict, ... ],
  "previous_feedback": { run_id: feedback_str },
  "previous_outputs": { run_id: [file_dict, ...] },
  "benchmark": { ... }       // only if --benchmark provided
}

previous_feedback and previous_outputs are populated by load_previous_iteration(), which reads feedback.json from the previous workspace and re-scans that workspace's runs.

Sources: skills/skill-creator/eval-viewer/generate_review.py250-281

HTTP Server

ReviewHandler extends BaseHTTPRequestHandler and exposes two routes:

Route	Method	Behavior
`/` or `/index.html`	`GET`	Re-scans workspace, regenerates HTML, serves it. This means a browser refresh picks up new eval outputs without restarting the server.
`/api/feedback`	`GET`	Returns current `feedback.json` contents.
`/api/feedback`	`POST`	Validates JSON body has `reviews` key, writes to `feedback.json`.

Request logging is suppressed via log_message() to keep the terminal output clean.

Sources: skills/skill-creator/eval-viewer/generate_review.py308-384

viewer.html

viewer.html is a self-contained single-page application served by ReviewHandler. It relies on EMBEDDED_DATA injected at serve time and uses the SheetJS library (loaded from CDN) for XLSX rendering.

UI Layout

viewer.html UI Sections

Sources: skills/skill-creator/eval-viewer/viewer.html543-631

init() loads any existing feedback.json from the server (via GET /api/feedback), unless previous_feedback or previous_outputs data is present — in that case the on-disk feedback.json is considered stale from the prior iteration and is not pre-filled.

showRun(index) is the central rendering function. It:

Updates the progress counter (N of M).
Renders the prompt text.
Detects the run's configuration from its id string (regex match on with_skill|without_skill|new_skill|old_skill) and shows a color-coded .config-badge.
Calls renderOutputs(), renderPrevOutputs(), renderGrades().
Shows any prior-iteration feedback from EMBEDDED_DATA.previous_feedback.
Populates the feedback textarea from feedbackMap.
Adds the run to visitedRuns; once all runs are visited, the "Submit All Reviews" button gains the .ready CSS class.

Arrow key navigation (left/right/up/down) is captured via document.addEventListener("keydown") and delegates to navigate(delta).

Sources: skills/skill-creator/eval-viewer/viewer.html656-761

Output File Rendering

renderOutputs() iterates run.outputs and renders each file according to its type field:

`type`	Renderer
`"text"`	`<pre>` with `textContent`
`"image"`	`<img src=data_uri>`
`"pdf"`	`<iframe src=data_uri>` (600px height)
`"xlsx"`	`renderXlsx()` via SheetJS
`"binary"`	`<a download>` link
`"error"`	`<pre>` in red

Every file gets a header bar with the filename and a "Download" anchor pointing to getDownloadUri(file).

renderXlsx() decodes file.data_b64 using atob, constructs a Uint8Array, calls XLSX.read(), then XLSX.utils.sheet_to_html() per sheet. Multi-sheet workbooks display a label per sheet.

Sources: skills/skill-creator/eval-viewer/viewer.html763-855

Grades Section

renderGrades() reads run.grading (a grading.json loaded by build_run()). The section is hidden when no grading data exists, and collapsed by default when it does. It shows:

A pass-rate badge (summary.pass_rate) colored green/neutral/red (≥80% / 50–80% / <50%).
A passed / failed of total summary line.
A list of expectations[] entries, each showing pass/fail icon, the expectation text, and an evidence sub-line.

Sources: skills/skill-creator/eval-viewer/viewer.html857-912

Previous Iteration Comparison

When --previous-workspace is supplied to generate_review.py, load_previous_iteration() populates EMBEDDED_DATA.previous_outputs and EMBEDDED_DATA.previous_feedback.

renderPrevOutputs() renders the previous iteration's output files into a collapsible "Previous Output" section using the same rendering logic as renderOutputs(). This lets a reviewer compare the current run's outputs side-by-side (above/below) with those from the prior skill version.

Previous iteration feedback appears below the current feedback textarea as a read-only block labeled "Previous feedback".

Sources: skills/skill-creator/eval-viewer/viewer.html914-991 skills/skill-creator/eval-viewer/generate_review.py213-247

Benchmark Tab

renderBenchmark() is called once at page load. If EMBEDDED_DATA.benchmark exists, it shows the "Benchmark" tab in .view-tabs.

It renders two layers of data:

Summary table — pass rate, time (seconds), and token counts as mean ± stddev for each configuration (dynamically discovered from run_summary keys excluding "delta"), plus a delta column.
Per-eval breakdown — for each eval_id, a table of per-run pass rates per configuration with averages. If expectations arrays are present, a per-assertion detail table is appended showing pass/fail icons per run.

The benchmark.json structure consumed here is documented in detail in Benchmarking and Reporting.

Sources: skills/skill-creator/eval-viewer/viewer.html1113-1318

Feedback Collection and Persistence

Feedback Data Flow

The in-memory feedbackMap is a plain object { run_id: feedback_text }. On final submit via showDoneDialog(), all runs are included — even those with no feedback — so that the consuming agent can distinguish "reviewed and looks fine" (empty string) from "not yet reviewed".

feedback.json schema:

Sources: skills/skill-creator/eval-viewer/viewer.html993-1067 skills/skill-creator/eval-viewer/generate_review.py361-378

eval_review.html

eval_review.html is a standalone HTML asset stored in skills/skill-creator/assets/. It is separate from the eval viewer; its purpose is to build and edit the evals.json trigger/no-trigger eval set for a skill, not to review evaluation run outputs.

Placeholders

skill-creator populates three template placeholders before serving or opening the file:

Placeholder	Replaced with
`__SKILL_NAME_PLACEHOLDER__`	Skill name string
`__SKILL_DESCRIPTION_PLACEHOLDER__`	Current SKILL.md description
`__EVAL_DATA_PLACEHOLDER__`	JSON array of `{query, should_trigger}` objects

Sources: skills/skill-creator/assets/eval_review.html6-63

UI and Workflow

The page renders a two-column table sorted into two groups: "Should Trigger" rows first, "Should NOT Trigger" rows second.

UI element	Behavior
Query textarea (`.query-input`)	Edits the query text; changes update `evalItems[]` via `updateQuery()`
Toggle switch (`.toggle`)	Flips `should_trigger` boolean; calls `updateTrigger()` which re-renders to reorder groups
Delete button (`.btn-delete`)	Removes the row via `deleteRow()`
`+ Add Query` button	Appends `{query: '', should_trigger: true}` and focuses the new textarea
`Export Eval Set` button	Calls `exportEvalSet()`, filters empty queries, downloads `eval_set.json`

exportEvalSet() serializes the current evalItems array (excluding blank queries) to a JSON file download. The exported format matches the evals.json format consumed by the evaluation pipeline.

Sources: skills/skill-creator/assets/eval_review.html62-146

Static Mode

When --static <path> is passed to generate_review.py, generate_html() is called once and written to the specified file. No HTTP server is started. In this mode, viewer.html's saveCurrentFeedback() handles fetch failures gracefully — it shows "Will download on submit" status text, and showDoneDialog() falls back to a Blob download of feedback.json instead of a POST request.

Sources: skills/skill-creator/eval-viewer/generate_review.py431-436 skills/skill-creator/eval-viewer/viewer.html1018-1023 skills/skill-creator/eval-viewer/viewer.html1049-1059

Eval Viewer and Feedback Collection

Relevant source files

Component Overview

Component-to-Code Map

Sources: skills/skill-creator/eval-viewer/generate_review.py1-471 skills/skill-creator/eval-viewer/viewer.html1-1326 skills/skill-creator/assets/eval_review.html1-147

generate_review.py

CLI Interface

The script is invoked directly by skill-creator after evaluation runs complete:

python generate_review.py <workspace-path> [options]

Flag	Default	Purpose
`--port` / `-p`	`3117`	HTTP server port
`--skill-name` / `-n`	derived from workspace dir name	Header label in UI
`--previous-workspace`	none	Path to prior iteration workspace for comparison
`--benchmark`	none	Path to `benchmark.json` for the Benchmark tab
`--static` / `-s`	none	Write standalone HTML to a file instead of serving

Sources: skills/skill-creator/eval-viewer/generate_review.py387-467

Workspace Scanning

Workspace Directory Traversal

Sources: skills/skill-creator/eval-viewer/generate_review.py60-146

File Embedding

embed_file() classifies each output file and produces a typed dict for the browser. The id field is the run's path relative to the workspace root, with / replaced by -.

File type	Condition	Dict fields
Text	extension in `TEXT_EXTENSIONS`	`type: "text"`, `content: str`
Image	extension in `IMAGE_EXTENSIONS`	`type: "image"`, `mime`, `data_uri` (base64)
PDF	`.pdf`	`type: "pdf"`, `data_uri` (base64)
XLSX	`.xlsx`	`type: "xlsx"`, `data_b64` (base64)
Other binary	anything else	`type: "binary"`, `mime`, `data_uri` (base64)

Sources: skills/skill-creator/eval-viewer/generate_review.py149-211

HTML Generation

generate_html() reads viewer.html and replaces the literal marker /*__EMBEDDED_DATA__*/ with a JavaScript assignment:

const EMBEDDED_DATA = <json blob>;

The JSON blob has this shape:

{
  "skill_name": str,
  "runs": [ run_dict, ... ],
  "previous_feedback": { run_id: feedback_str },
  "previous_outputs": { run_id: [file_dict, ...] },
  "benchmark": { ... }       // only if --benchmark provided
}

previous_feedback and previous_outputs are populated by load_previous_iteration(), which reads feedback.json from the previous workspace and re-scans that workspace's runs.

Sources: skills/skill-creator/eval-viewer/generate_review.py250-281

HTTP Server

ReviewHandler extends BaseHTTPRequestHandler and exposes two routes:

Route	Method	Behavior
`/` or `/index.html`	`GET`	Re-scans workspace, regenerates HTML, serves it. This means a browser refresh picks up new eval outputs without restarting the server.
`/api/feedback`	`GET`	Returns current `feedback.json` contents.
`/api/feedback`	`POST`	Validates JSON body has `reviews` key, writes to `feedback.json`.

Request logging is suppressed via log_message() to keep the terminal output clean.

Sources: skills/skill-creator/eval-viewer/generate_review.py308-384

viewer.html

UI Layout

viewer.html UI Sections

Sources: skills/skill-creator/eval-viewer/viewer.html543-631

showRun(index) is the central rendering function. It:

Updates the progress counter (N of M).
Renders the prompt text.
Detects the run's configuration from its id string (regex match on with_skill|without_skill|new_skill|old_skill) and shows a color-coded .config-badge.
Calls renderOutputs(), renderPrevOutputs(), renderGrades().
Shows any prior-iteration feedback from EMBEDDED_DATA.previous_feedback.
Populates the feedback textarea from feedbackMap.
Adds the run to visitedRuns; once all runs are visited, the "Submit All Reviews" button gains the .ready CSS class.

Arrow key navigation (left/right/up/down) is captured via document.addEventListener("keydown") and delegates to navigate(delta).

Sources: skills/skill-creator/eval-viewer/viewer.html656-761

Output File Rendering

renderOutputs() iterates run.outputs and renders each file according to its type field:

`type`	Renderer
`"text"`	`<pre>` with `textContent`
`"image"`	`<img src=data_uri>`
`"pdf"`	`<iframe src=data_uri>` (600px height)
`"xlsx"`	`renderXlsx()` via SheetJS
`"binary"`	`<a download>` link
`"error"`	`<pre>` in red

Every file gets a header bar with the filename and a "Download" anchor pointing to getDownloadUri(file).

renderXlsx() decodes file.data_b64 using atob, constructs a Uint8Array, calls XLSX.read(), then XLSX.utils.sheet_to_html() per sheet. Multi-sheet workbooks display a label per sheet.

Sources: skills/skill-creator/eval-viewer/viewer.html763-855

Grades Section

renderGrades() reads run.grading (a grading.json loaded by build_run()). The section is hidden when no grading data exists, and collapsed by default when it does. It shows:

A pass-rate badge (summary.pass_rate) colored green/neutral/red (≥80% / 50–80% / <50%).
A passed / failed of total summary line.
A list of expectations[] entries, each showing pass/fail icon, the expectation text, and an evidence sub-line.

Sources: skills/skill-creator/eval-viewer/viewer.html857-912

Previous Iteration Comparison

When --previous-workspace is supplied to generate_review.py, load_previous_iteration() populates EMBEDDED_DATA.previous_outputs and EMBEDDED_DATA.previous_feedback.

Previous iteration feedback appears below the current feedback textarea as a read-only block labeled "Previous feedback".

Sources: skills/skill-creator/eval-viewer/viewer.html914-991 skills/skill-creator/eval-viewer/generate_review.py213-247

Benchmark Tab

renderBenchmark() is called once at page load. If EMBEDDED_DATA.benchmark exists, it shows the "Benchmark" tab in .view-tabs.

It renders two layers of data:

Summary table — pass rate, time (seconds), and token counts as mean ± stddev for each configuration (dynamically discovered from run_summary keys excluding "delta"), plus a delta column.
Per-eval breakdown — for each eval_id, a table of per-run pass rates per configuration with averages. If expectations arrays are present, a per-assertion detail table is appended showing pass/fail icons per run.

The benchmark.json structure consumed here is documented in detail in Benchmarking and Reporting.

Sources: skills/skill-creator/eval-viewer/viewer.html1113-1318

Feedback Collection and Persistence

Feedback Data Flow

feedback.json schema:

Sources: skills/skill-creator/eval-viewer/viewer.html993-1067 skills/skill-creator/eval-viewer/generate_review.py361-378

eval_review.html

Placeholders

skill-creator populates three template placeholders before serving or opening the file:

Placeholder	Replaced with
`__SKILL_NAME_PLACEHOLDER__`	Skill name string
`__SKILL_DESCRIPTION_PLACEHOLDER__`	Current SKILL.md description
`__EVAL_DATA_PLACEHOLDER__`	JSON array of `{query, should_trigger}` objects

Sources: skills/skill-creator/assets/eval_review.html6-63

UI and Workflow

The page renders a two-column table sorted into two groups: "Should Trigger" rows first, "Should NOT Trigger" rows second.

UI element	Behavior
Query textarea (`.query-input`)	Edits the query text; changes update `evalItems[]` via `updateQuery()`
Toggle switch (`.toggle`)	Flips `should_trigger` boolean; calls `updateTrigger()` which re-renders to reorder groups
Delete button (`.btn-delete`)	Removes the row via `deleteRow()`
`+ Add Query` button	Appends `{query: '', should_trigger: true}` and focuses the new textarea
`Export Eval Set` button	Calls `exportEvalSet()`, filters empty queries, downloads `eval_set.json`

exportEvalSet() serializes the current evalItems array (excluding blank queries) to a JSON file download. The exported format matches the evals.json format consumed by the evaluation pipeline.

Sources: skills/skill-creator/assets/eval_review.html62-146

Static Mode

Sources: skills/skill-creator/eval-viewer/generate_review.py431-436 skills/skill-creator/eval-viewer/viewer.html1018-1023 skills/skill-creator/eval-viewer/viewer.html1049-1059

Eval Viewer and Feedback Collection

Component Overview

generate_review.py

CLI Interface

Workspace Scanning

File Embedding

HTML Generation

HTTP Server

viewer.html

UI Layout

Run Navigation

Output File Rendering

Grades Section

Previous Iteration Comparison

Benchmark Tab

Feedback Collection and Persistence

eval_review.html

Placeholders

UI and Workflow

Static Mode

On this page

Eval Viewer and Feedback Collection

Component Overview

generate_review.py

CLI Interface

Workspace Scanning

File Embedding

HTML Generation

HTTP Server

viewer.html

UI Layout

Run Navigation

Output File Rendering

Grades Section

Previous Iteration Comparison

Benchmark Tab

Feedback Collection and Persistence

eval_review.html

Placeholders

UI and Workflow

Static Mode

On this page