The Evaluation System provides AI-powered scoring and analysis of prompts after test execution. Users can evaluate individual prompts (original or optimized) or perform comparative analysis between two versions. The system uses dedicated evaluation templates processed by the LLM to generate structured EvaluationResponse objects containing scores, strengths, weaknesses, and improvement suggestions.
This system is distinct from optimization (Prompt Optimization Service) and operates on test results within workspace testing panels (Test Panels and Multi-Column Testing).
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue366-415 packages/ui/src/components/context-mode/ContextSystemWorkspace.vue331-385
| UI Component | File | Key Methods/Props |
|---|---|---|
EvaluationScoreBadge | packages/ui/src/components/evaluation/EvaluationScoreBadge.vue | :score, :level, :result, @show-detail, @apply-improvement |
EvaluationPanel | packages/ui/src/components/evaluation/EvaluationPanel.vue | v-model:show, :is-evaluating, :result, @apply-improvement |
ContextUserWorkspace | packages/ui/src/components/context-mode/ContextUserWorkspace.vue | handleEvaluate(), evaluation state refs |
ContextSystemWorkspace | packages/ui/src/components/context-mode/ContextSystemWorkspace.vue | handleEvaluate(), evaluation state refs |
| Composable | File | Exports |
|---|---|---|
useEvaluationHandler | packages/ui/src/composables/prompt/useEvaluationHandler.ts | handleEvaluate(), handleReEvaluate(), handleEvaluateOriginal(), handleEvaluateOptimized(), handleEvaluateCompare() |
useEvaluation | packages/ui/src/composables/prompt/useEvaluation.ts | evaluatePrompt(), calculateScoreLevel(), state refs |
provideEvaluation | packages/ui/src/composables/prompt/provideEvaluation.ts | Context provider for evaluation state |
provideProContext | packages/ui/src/composables/prompt/provideProContext.ts | Context provider for Pro mode evaluation context |
| Service Method | File | Signature |
|---|---|---|
PromptService.evaluatePrompt() | packages/core/src/services/prompt/service.ts | (request: EvaluationRequest) => Promise<EvaluationResponse> |
LLMService.sendMessage() | packages/core/src/services/llm/service.ts | (messages, modelKey, options) => Promise<string> |
TemplateManager.getTemplate() | packages/core/src/services/template/manager.ts | `(id: string) => Template |
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue493-521 packages/ui/src/components/context-mode/ContextSystemWorkspace.vue466-496 packages/ui/src/components/evaluation/
The system supports three evaluation types defined by the EvaluationType union:
| Type | Template ID | Context Required | Output Focus | Availability |
|---|---|---|---|---|
original | evaluation | Original prompt + test result | Absolute quality score | When variant 'a' has result |
optimized | evaluation | Optimized prompt + test result | Absolute quality score | When variant 'b' has result |
compare | compare-evaluation | Both prompts + both results | Relative comparison | 2-column mode with both results |
original/optimized): Available for any column with test resultstestColumnCountModel === 2 (2-column layout)hasVariantResult('a') && hasVariantResult('b')Compare mode uses different template variables to analyze improvement delta rather than absolute quality.
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue250-272 packages/ui/src/components/context-mode/ContextSystemWorkspace.vue205-227
ContextUserWorkspace.handleEvaluate(type: EvaluationType)
└─> evaluationHandler.handleEvaluate(type)
└─> evaluation.evaluatePrompt(request: EvaluationRequest)
└─> services.promptService.evaluatePrompt(request)
├─> templateManager.getTemplate(templateId)
├─> processTemplate(template, variables)
└─> llmService.sendMessage(messages, modelKey)
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue1356-1425 packages/ui/src/composables/prompt/useEvaluationHandler.ts1-150 packages/core/src/services/prompt/service.ts200-350
Input to evaluation service:
LLM returns structured JSON response:
UI-side categorical score level:
| ScoreLevel | Badge Color | Icon | Score Range |
|---|---|---|---|
excellent | Green | ✓ | 90-100 |
good | Blue | ✓ | 75-89 |
average | Yellow | ! | 60-74 |
poor | Red | × | 0-59 |
unknown | Gray | ? | null |
Sources: packages/core/src/services/prompt/types.ts150-200 packages/ui/src/composables/prompt/useEvaluation.ts40-80 packages/ui/src/components/evaluation/EvaluationScoreBadge.vue1-100
Evaluation uses two template types:
| Template ID | Template Type | Usage |
|---|---|---|
evaluation | evaluation | Single prompt evaluation |
compare-evaluation | evaluation | Comparative evaluation |
Templates are retrieved via TemplateManager.getTemplate(id) and processed with variable substitution.
Evaluation templates receive the following variables:
Single Evaluation Variables
| Variable | Source | Example Value |
|---|---|---|
targetPrompt | request.targetPrompt | User's prompt text |
optimizationMode | request.optimizationMode | 'system' or 'user' |
testResult | Test variant result | LLM response text |
Compare Evaluation Variables
| Variable | Source | Example Value |
|---|---|---|
originalPrompt | request.context.originalPrompt | V0 prompt |
optimizedPrompt | request.context.optimizedPrompt | V1 prompt |
originalResult | Variant 'a' result | Original test output |
optimizedResult | Variant 'b' result | Optimized test output |
optimizationMode | request.optimizationMode | 'system' or 'user' |
Template processor uses mustache-style syntax: {{variableName}} is replaced with actual values before sending to LLM.
Sources: packages/core/src/services/prompt/service.ts200-280 packages/core/src/services/template/processor.ts50-150 packages/core/src/services/template/manager.ts100-200
PromptService.parseEvaluationResponse() handles multiple response formats:
```json ... ```Parsing logic:
Missing fields are populated with safe defaults:
| Field | Type | Default if Missing |
|---|---|---|
score | number | null |
level | string | 'unknown' |
aspects | object | {} |
strengths | string[] | [] |
weaknesses | string[] | [] |
improvements | string[] | [] |
summary | string? | undefined |
The system does not throw errors for missing optional fields, ensuring graceful degradation.
Sources: packages/core/src/services/prompt/service.ts320-420
Component Integration Architecture
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue200-280 packages/ui/src/components/TestResultSection.vue1-200 packages/ui/src/components/evaluation/FocusAnalyzeButton.vue1-100
File: packages/ui/src/components/evaluation/EvaluationScoreBadge.vue
Props
Events
Usage in Workspace
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue250-280 packages/ui/src/components/TestResultSection.vue40-120
File: packages/ui/src/components/evaluation/FocusAnalyzeButton.vue
The FocusAnalyzeButton provides a compact entry point for evaluation, typically used when no evaluation exists yet.
Props
Events
Usage
Sources: packages/ui/src/components/evaluation/FocusAnalyzeButton.vue1-150 packages/ui/src/components/context-mode/ContextUserWorkspace.vue265-275
File: packages/ui/src/components/evaluation/FeedbackAnalyzeButton.vue
The FeedbackAnalyzeButton embeds within EvaluationScoreBadge hover card, allowing users to provide context or feedback before evaluation.
Props
Events
Behavior
evaluate-with-feedback with feedback stringUsage Context
Embedded in EvaluationHoverCard component, which is triggered by hovering over EvaluationScoreBadge.
Sources: packages/ui/src/components/evaluation/FeedbackAnalyzeButton.vue1-134
File: packages/ui/src/components/evaluation/EvaluationPanel.vue
Props
Events
Usage in Workspace
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue366-438 packages/ui/src/components/evaluation/EvaluationScoreBadge.vue1-200 packages/ui/src/components/evaluation/EvaluationPanel.vue1-300
Each workspace maintains evaluation state for active test variants:
State Refs (per variant type)
Computed Properties
Evaluation state is provided to child components via composable context:
Handler composable injects this context:
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue1110-1210 packages/ui/src/composables/prompt/provideEvaluation.ts1-100 packages/ui/src/composables/prompt/provideProContext.ts1-100
Users apply evaluation suggestions via the detail panel:
Flow
EvaluationPanel with improvements array@apply-improvement="handleApplyImprovement"Handler Implementation
Note: Improvement text replaces entire prompt content. For partial edits, use patch operations.
For granular modifications, the system supports PatchOperation:
Handler Implementation
Utility Function: applyPatchOperationsToText() from @prompt-optimizer/core applies operations to text content.
Example Patch
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue1356-1387 packages/core/src/services/prompt/utils.ts1-100
| Characteristic | Single Evaluation | Compare Evaluation |
|---|---|---|
| Template | evaluation | compare-evaluation |
| Input | One prompt + result | Two prompts + two results |
| Score Meaning | Absolute quality (0-100) | Relative improvement or preference |
| Output Focus | Individual quality metrics | Improvement delta and comparison |
| Availability | Any variant with result | 2-column mode with both results |
| Button Location | In OutputDisplay toolbar | In control bar above variants |
Compare evaluation requires both variants' data:
Compare evaluation button only appears when:
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue250-272 packages/ui/src/composables/prompt/useEvaluationHandler.ts100-180
Evaluation can use a different model than optimization for cost/performance optimization.
Model Selection Hierarchy
Model Requirements
| Capability | Optimization Model | Evaluation Model |
|---|---|---|
| JSON output | Nice to have | Required |
| Creativity | High | Low |
| Reasoning | High | Medium |
| Speed | Slow acceptable | Prefer fast |
| Cost | High acceptable | Prefer low |
Configuration Location
Evaluation model is configured in:
FunctionModelManager.evaluationModelKeyPromptOptimizerApp → workspace components → evaluationModelKey propExample Usage
User can set GPT-4 for optimization but use GPT-3.5-turbo for evaluation to reduce API costs while maintaining evaluation quality.
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue830-832 packages/ui/src/composables/model/useFunctionModelManager.ts50-150
| Error Type | Cause | UI Behavior | Recovery |
|---|---|---|---|
| Template Not Found | Missing evaluation or compare-evaluation template | Error toast | Prompt user to restore default templates |
| Invalid JSON | LLM returned non-JSON or malformed JSON | Error toast with parse details | Retry button in panel |
| Model Unavailable | Evaluation model deleted/disabled | Error toast | Fallback to optimization model or prompt configuration |
| API Failure | Network error or LLM API error | Error toast with API message | Retry button |
| Empty Response | LLM returned empty string | Error toast | Retry with different model |
Handler Response
score is missing but other fields exist, show "Unknown" score levelimprovements array is empty, hide "Apply" buttons but show other fieldsEvaluationPanel includes "Retry" button that re-invokes handleReEvaluate()Sources: packages/ui/src/composables/prompt/useEvaluation.ts150-250 packages/ui/src/composables/prompt/useEvaluationHandler.ts80-150
| Aspect | Implementation | Rationale |
|---|---|---|
| User-triggered | No auto-evaluation; requires explicit click | Avoid unnecessary API calls |
| Concurrent evaluations | Allows simultaneous original + optimized + compare | Non-blocking UI |
| Model selection | Separate evaluation model config | Use faster/cheaper model for evaluation |
| No debouncing | Immediate evaluation on click | User expects instant feedback |
In-Memory Only
Evaluation results are not persisted:
ref() objects/basic/* → /pro/*)PreferenceService or session storesRationale: Evaluation is specific to current test context. Prompts change frequently, making stale evaluation results misleading.
Evaluation results are NOT automatically invalidated when:
User must manually re-evaluate to get updated scores. This is intentional to avoid surprise API costs.
Cache Key: Results are stored by EvaluationType only, not by prompt content hash. If prompt changes, old evaluation remains visible until user triggers new evaluation.
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue1110-1150 packages/ui/src/composables/prompt/useEvaluation.ts50-120
2-Column Mode
TestResultSection header (original result card)TestResultSection header (optimized result card)TestControlBar (top-right, next to "Run All")3-4 Column Mode
Conditional Rendering
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue220-280 packages/ui/src/components/TestResultSection.vue30-120
Button State Transitions
| Initial State | User Action | New State | Component |
|---|---|---|---|
| No result | - | (hidden) | - |
| Has result, no eval | Click "Evaluate" | Evaluating | FocusAnalyzeButton → loading |
| Evaluating | (wait) | Has evaluation | Loading indicator |
| Has evaluation | - | Show score | EvaluationScoreBadge |
| Has evaluation | Click badge | Detail panel | EvaluationPanel opens |
Sources: packages/ui/src/components/TestResultSection.vue40-120 packages/ui/src/components/context-mode/ContextUserWorkspace.vue366-415
Compare evaluation requires:
testColumnCountModel.value === 2hasVariantResult('a') === truehasVariantResult('b') === trueIf any condition is false, compare button/badge is not rendered.
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue220-272 packages/ui/src/components/context-mode/ContextUserWorkspace.vue366-415
Evaluation UI text uses the evaluation.* namespace:
| Translation Key | English | Chinese (Simplified) |
|---|---|---|
evaluation.title | "Prompt Evaluation" | "提示词评估" |
evaluation.evaluate | "Evaluate" | "评估" |
evaluation.score | "Score" | "分数" |
evaluation.level.excellent | "Excellent" | "优秀" |
evaluation.level.good | "Good" | "良好" |
evaluation.level.average | "Average" | "一般" |
evaluation.level.poor | "Poor" | "较差" |
evaluation.strengths | "Strengths" | "优点" |
evaluation.weaknesses | "Weaknesses" | "缺点" |
evaluation.improvements | "Improvements" | "改进建议" |
evaluation.compareEvaluate | "Compare Evaluation" | "对比评估" |
evaluation.applyImprovement | "Apply Improvement" | "应用改进" |
evaluation.appliedImprovement | "Improvement applied" | "已应用改进" |
evaluation.feedbackAnalyze | "Analyze with Feedback" | "带反馈分析" |
evaluation.feedbackTitle | "Provide Context" | "提供上下文" |
evaluation.focusAnalyze | "Focus Analyze" | "聚焦分析" |
Sources: packages/ui/src/i18n/locales/en-US.ts1270-1350 packages/ui/src/i18n/locales/zh-CN.ts1270-1350
Note: Evaluation response content (from LLM) uses the template's language, not the UI locale. If evaluation template is in English, LLM returns English strengths, weaknesses, and improvements regardless of UI language setting.
To get localized evaluation responses:
evaluation-en, evaluation-zh)Sources: packages/ui/src/i18n/locales/en-US.ts1270-1320 packages/ui/src/i18n/locales/zh-CN.ts1270-1320
Basic Evaluation Workflow
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue366-438 packages/ui/src/components/context-mode/ContextUserWorkspace.vue1356-1387
Evaluation with User Feedback Workflow
Sources: packages/ui/src/components/evaluation/FeedbackAnalyzeButton.vue1-134 packages/ui/src/components/evaluation/FocusAnalyzeButton.vue1-150
The Evaluation System provides:
The system enhances the optimization workflow by providing objective quality metrics and concrete improvement paths, helping users iteratively refine their prompts beyond the initial optimization.
Sources: packages/ui/src/components/context-mode/ContextUserWorkspace.vue1-900 packages/ui/src/components/context-mode/ContextSystemWorkspace.vue1-800 packages/core/src/services/prompt/service.ts1-700
Refresh this wiki