← Back to list

advanced-evaluation
by 5dlabs
Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents
⭐ 2🍴 1📅 Jan 24, 2026
SKILL.md
name: advanced-evaluation description: LLM-as-Judge techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. agents: [cleo, tess, morgan, atlas] triggers: [LLM-as-judge, compare outputs, evaluation rubrics, mitigate bias, direct scoring, pairwise comparison]
Advanced Evaluation
Production-grade techniques for evaluating LLM outputs using LLMs as judges.
Evaluation Taxonomy
Direct Scoring
Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift
Pairwise Comparison
LLM compares two responses and selects the better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias
The Bias Landscape
| Bias | Description | Mitigation |
|---|---|---|
| Position | First-position responses favored | Swap positions, majority vote |
| Length | Longer = higher rating | Explicit prompting to ignore length |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Detailed explanations favored | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
Direct Scoring Implementation
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{criteria with descriptions and weights}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
Critical: Always require justification BEFORE the score. Improves reliability 15-25%.
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
- First pass: A in first position, B in second
- Second pass: B in first position, A in second
- Consistency check: If passes disagree, return TIE
- Final verdict: Consistent winner with averaged confidence
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when genuinely equivalent
Rubric Generation
Components:
- Level descriptions with clear boundaries
- Observable characteristics for each level
- Examples for each level
- Edge case guidance
- General scoring principles
Strictness levels:
- Lenient: Lower bar, encourages iteration
- Balanced: Typical production use
- Strict: High-stakes or safety-critical
Decision Framework
Is there objective ground truth?
├── Yes → Direct Scoring
│ (factual accuracy, instruction following)
└── No → Is it a preference judgment?
├── Yes → Pairwise Comparison
│ (tone, style, persuasiveness)
└── No → Reference-based evaluation
(summarization, translation)
Scaling Evaluation
| Approach | Use Case | Trade-off |
|---|---|---|
| Panel of LLMs | High-stakes decisions | More expensive, more reliable |
| Hierarchical | Large volumes | Fast screening + careful edge cases |
| Human-in-loop | Critical applications | Best reliability, feedback loop |
Guidelines
- Always require justification before scores
- Always swap positions in pairwise comparison
- Match scale granularity to rubric specificity
- Separate objective and subjective criteria
- Include confidence scores calibrated to consistency
- Define edge cases explicitly
- Validate against human judgments
Score
Total Score
65/100
Based on repository quality metrics
✓SKILL.md
SKILL.mdファイルが含まれている
+20
✓LICENSE
ライセンスが設定されている
+10
○説明文
100文字以上の説明がある
0/10
○人気
GitHub Stars 100以上
0/15
✓最近の活動
1ヶ月以内に更新
+10
○フォーク
10回以上フォークされている
0/5
✓Issue管理
オープンIssueが50未満
+5
✓言語
プログラミング言語が設定されている
+5
✓タグ
1つ以上のタグが設定されている
+5
Reviews
💬
Reviews coming soon


