This page documents the eval viewer subsystem inside skill-creator: the generate_review.py server, the viewer.html review interface, and the eval_review.html eval-set builder. Together these tools allow a developer to inspect evaluation run outputs, record qualitative feedback, and compare runs across skill iterations.
For the upstream pipeline that produces the run outputs and grading data consumed here, see Evaluation Pipeline. For the benchmark statistics that appear in the viewer's Benchmark tab, see Benchmarking and Reporting. For the full skill creation lifecycle that calls the viewer, see Skill Creator Workflow.
Component-to-Code Map
Sources: skills/skill-creator/eval-viewer/generate_review.py1-471 skills/skill-creator/eval-viewer/viewer.html1-1326 skills/skill-creator/assets/eval_review.html1-147
The script is invoked directly by skill-creator after evaluation runs complete:
python generate_review.py <workspace-path> [options]
| Flag | Default | Purpose |
|---|---|---|
--port / -p | 3117 | HTTP server port |
--skill-name / -n | derived from workspace dir name | Header label in UI |
--previous-workspace | none | Path to prior iteration workspace for comparison |
--benchmark | none | Path to benchmark.json for the Benchmark tab |
--static / -s | none | Write standalone HTML to a file instead of serving |
If a process is already listening on the target port, _kill_port() uses lsof to terminate it before binding. If the port is still unavailable after the kill, the server falls back to an OS-assigned port.
Sources: skills/skill-creator/eval-viewer/generate_review.py387-467
find_runs() recursively walks a workspace directory looking for any subdirectory that contains an outputs/ child. It skips node_modules, .git, __pycache__, skill, and inputs. The result is sorted by (eval_id, run_id).
Workspace Directory Traversal
Sources: skills/skill-creator/eval-viewer/generate_review.py60-146
embed_file() classifies each output file and produces a typed dict for the browser. The id field is the run's path relative to the workspace root, with / replaced by -.
| File type | Condition | Dict fields |
|---|---|---|
| Text | extension in TEXT_EXTENSIONS | type: "text", content: str |
| Image | extension in IMAGE_EXTENSIONS | type: "image", mime, data_uri (base64) |
.pdf | type: "pdf", data_uri (base64) | |
| XLSX | .xlsx | type: "xlsx", data_b64 (base64) |
| Other binary | anything else | type: "binary", mime, data_uri (base64) |
TEXT_EXTENSIONS covers .txt, .md, .json, .csv, .py, .js, .ts, .tsx, .jsx, .yaml, .yml, .xml, .html, .css, .sh, .rb, .go, .rs, .java, .c, .cpp, .h, .hpp, .sql, .r, .toml.
Sources: skills/skill-creator/eval-viewer/generate_review.py149-211
generate_html() reads viewer.html and replaces the literal marker /*__EMBEDDED_DATA__*/ with a JavaScript assignment:
const EMBEDDED_DATA = <json blob>;
The JSON blob has this shape:
{
"skill_name": str,
"runs": [ run_dict, ... ],
"previous_feedback": { run_id: feedback_str },
"previous_outputs": { run_id: [file_dict, ...] },
"benchmark": { ... } // only if --benchmark provided
}
previous_feedback and previous_outputs are populated by load_previous_iteration(), which reads feedback.json from the previous workspace and re-scans that workspace's runs.
Sources: skills/skill-creator/eval-viewer/generate_review.py250-281
ReviewHandler extends BaseHTTPRequestHandler and exposes two routes:
| Route | Method | Behavior |
|---|---|---|
/ or /index.html | GET | Re-scans workspace, regenerates HTML, serves it. This means a browser refresh picks up new eval outputs without restarting the server. |
/api/feedback | GET | Returns current feedback.json contents. |
/api/feedback | POST | Validates JSON body has reviews key, writes to feedback.json. |
Request logging is suppressed via log_message() to keep the terminal output clean.
Sources: skills/skill-creator/eval-viewer/generate_review.py308-384
viewer.html is a self-contained single-page application served by ReviewHandler. It relies on EMBEDDED_DATA injected at serve time and uses the SheetJS library (loaded from CDN) for XLSX rendering.
viewer.html UI Sections
Sources: skills/skill-creator/eval-viewer/viewer.html543-631
init() loads any existing feedback.json from the server (via GET /api/feedback), unless previous_feedback or previous_outputs data is present — in that case the on-disk feedback.json is considered stale from the prior iteration and is not pre-filled.
showRun(index) is the central rendering function. It:
N of M).id string (regex match on with_skill|without_skill|new_skill|old_skill) and shows a color-coded .config-badge.renderOutputs(), renderPrevOutputs(), renderGrades().EMBEDDED_DATA.previous_feedback.feedbackMap.visitedRuns; once all runs are visited, the "Submit All Reviews" button gains the .ready CSS class.Arrow key navigation (left/right/up/down) is captured via document.addEventListener("keydown") and delegates to navigate(delta).
Sources: skills/skill-creator/eval-viewer/viewer.html656-761
renderOutputs() iterates run.outputs and renders each file according to its type field:
type | Renderer |
|---|---|
"text" | <pre> with textContent |
"image" | <img src=data_uri> |
"pdf" | <iframe src=data_uri> (600px height) |
"xlsx" | renderXlsx() via SheetJS |
"binary" | <a download> link |
"error" | <pre> in red |
Every file gets a header bar with the filename and a "Download" anchor pointing to getDownloadUri(file).
renderXlsx() decodes file.data_b64 using atob, constructs a Uint8Array, calls XLSX.read(), then XLSX.utils.sheet_to_html() per sheet. Multi-sheet workbooks display a label per sheet.
Sources: skills/skill-creator/eval-viewer/viewer.html763-855
renderGrades() reads run.grading (a grading.json loaded by build_run()). The section is hidden when no grading data exists, and collapsed by default when it does. It shows:
summary.pass_rate) colored green/neutral/red (≥80% / 50–80% / <50%).passed / failed of total summary line.expectations[] entries, each showing pass/fail icon, the expectation text, and an evidence sub-line.Sources: skills/skill-creator/eval-viewer/viewer.html857-912
When --previous-workspace is supplied to generate_review.py, load_previous_iteration() populates EMBEDDED_DATA.previous_outputs and EMBEDDED_DATA.previous_feedback.
renderPrevOutputs() renders the previous iteration's output files into a collapsible "Previous Output" section using the same rendering logic as renderOutputs(). This lets a reviewer compare the current run's outputs side-by-side (above/below) with those from the prior skill version.
Previous iteration feedback appears below the current feedback textarea as a read-only block labeled "Previous feedback".
Sources: skills/skill-creator/eval-viewer/viewer.html914-991 skills/skill-creator/eval-viewer/generate_review.py213-247
renderBenchmark() is called once at page load. If EMBEDDED_DATA.benchmark exists, it shows the "Benchmark" tab in .view-tabs.
It renders two layers of data:
mean ± stddev for each configuration (dynamically discovered from run_summary keys excluding "delta"), plus a delta column.eval_id, a table of per-run pass rates per configuration with averages. If expectations arrays are present, a per-assertion detail table is appended showing pass/fail icons per run.The benchmark.json structure consumed here is documented in detail in Benchmarking and Reporting.
Sources: skills/skill-creator/eval-viewer/viewer.html1113-1318
Feedback Data Flow
The in-memory feedbackMap is a plain object { run_id: feedback_text }. On final submit via showDoneDialog(), all runs are included — even those with no feedback — so that the consuming agent can distinguish "reviewed and looks fine" (empty string) from "not yet reviewed".
feedback.json schema:
Sources: skills/skill-creator/eval-viewer/viewer.html993-1067 skills/skill-creator/eval-viewer/generate_review.py361-378
eval_review.html is a standalone HTML asset stored in skills/skill-creator/assets/. It is separate from the eval viewer; its purpose is to build and edit the evals.json trigger/no-trigger eval set for a skill, not to review evaluation run outputs.
skill-creator populates three template placeholders before serving or opening the file:
| Placeholder | Replaced with |
|---|---|
__SKILL_NAME_PLACEHOLDER__ | Skill name string |
__SKILL_DESCRIPTION_PLACEHOLDER__ | Current SKILL.md description |
__EVAL_DATA_PLACEHOLDER__ | JSON array of {query, should_trigger} objects |
Sources: skills/skill-creator/assets/eval_review.html6-63
The page renders a two-column table sorted into two groups: "Should Trigger" rows first, "Should NOT Trigger" rows second.
| UI element | Behavior |
|---|---|
Query textarea (.query-input) | Edits the query text; changes update evalItems[] via updateQuery() |
Toggle switch (.toggle) | Flips should_trigger boolean; calls updateTrigger() which re-renders to reorder groups |
Delete button (.btn-delete) | Removes the row via deleteRow() |
+ Add Query button | Appends {query: '', should_trigger: true} and focuses the new textarea |
Export Eval Set button | Calls exportEvalSet(), filters empty queries, downloads eval_set.json |
exportEvalSet() serializes the current evalItems array (excluding blank queries) to a JSON file download. The exported format matches the evals.json format consumed by the evaluation pipeline.
Sources: skills/skill-creator/assets/eval_review.html62-146
When --static <path> is passed to generate_review.py, generate_html() is called once and written to the specified file. No HTTP server is started. In this mode, viewer.html's saveCurrentFeedback() handles fetch failures gracefully — it shows "Will download on submit" status text, and showDoneDialog() falls back to a Blob download of feedback.json instead of a POST request.
Sources: skills/skill-creator/eval-viewer/generate_review.py431-436 skills/skill-creator/eval-viewer/viewer.html1018-1023 skills/skill-creator/eval-viewer/viewer.html1049-1059
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.