
judge
by aiskillstore
Security-audited skills for Claude, Codex & Claude Code. One-click install, quality verified.
SKILL.md
name: judge description: Scoring framework for test-kitchen cookoff and omakase-off. Invoked at Phase 4 to evaluate implementations using 5-criteria scoring. Do not invoke directly - called by cookoff/omakase-off.
Test Kitchen Judge
Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.
Terminology: This skill uses "impl" but works for both:
- Cookoff: impl-1, impl-2, impl-3 (same design, different implementations)
- Omakase: variant-a, variant-b (different approaches/designs)
REQUIRED OUTPUT FORMAT
You MUST produce this exact structure. Do not summarize or abbreviate.
## Gate Check
| Impl | Tests Pass | Design Adherence |
|------|------------|------------------|
| impl-1 | X/X ✓ or ✗ | Yes/No |
| impl-2 | X/X ✓ or ✗ | Yes/No |
## Feasibility Check
| Impl | Status | Notes |
|------|--------|-------|
| impl-1 | ✓ OK / ⚠️ Flag | Details |
| impl-2 | ✓ OK / ⚠️ Flag | Details |
## Scoring Worksheet
### impl-1
**Fitness for Purpose** (Does it solve the actual problem?)
*Functional requirements:*
- [ ] Primary use case works end-to-end?
- [ ] All explicitly stated requirements implemented?
- [ ] Handles realistic scenarios, not just happy path?
*User needs (beyond literal requirements):*
- [ ] Would the user actually use this, or just demo it?
- [ ] Does it solve the real problem, not just the literal request?
- [ ] Does deployment/distribution match stated needs?
*Future considerations (if relevant):*
- [ ] If growth/scaling mentioned, does architecture support it?
- [ ] If team/collaboration mentioned, is it maintainable by others?
Checklist: _/8 YES → **Score: _/5** (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1)
*Note: Not all items apply to every project. Score based on relevant items.*
**Justified Complexity** (Every line earning its keep?)
- Unnecessary abstractions: ___
- Dead code: ___
- Bloat estimate: ___%
*Line count comparison (if multiple impls):*
- This impl: ___ lines
- Smallest impl: ___ lines
- Extra lines justified by: ___
→ **Score: _/5** (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%)
**Readability** (Understand core flow in 5 min?)
Violations:
- [ ] Single-letter vars (not loop index): +1 each = __
- [ ] Functions >50 lines: +1 each = __
- [ ] Nesting >3 levels: +1 each = __
- [ ] Magic numbers: +1 each = __
- [ ] Bad function names: +1 each = __
Total violations: __ → **Score: _/5** (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1)
**Robustness & Scale** (Handles unexpected + growth?)
- [ ] Input validation?
- [ ] External call error handling?
- [ ] Useful error messages?
- [ ] Null/empty handling?
- [ ] Async timeouts?
- [ ] No unbounded loops?
- [ ] O(n log n) or better?
- [ ] Bounded memory?
- [ ] Queries paginated?
- [ ] No blocking I/O in hot path?
- [ ] Backoff/retry logic?
- [ ] Handles 10x load?
Checklist: _/12 YES + feasibility flags → **Score: _/5**
(11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1)
**Maintainability** (Pain of next change?)
- [ ] Single responsibility per function?
- [ ] Explicit dependencies (no globals)?
- [ ] Business logic separated from infra?
- [ ] New feature = ≤3 files changed?
- [ ] Config externalized?
- [ ] Tests catch regressions?
Checklist: _/6 YES → **Score: _/5** (6=5, 5=4, 4=3, 2-3=2, 0-1=1)
### impl-2
[REPEAT SAME FORMAT]
### impl-3 (if applicable)
[REPEAT SAME FORMAT]
## Judge Scorecard
| Criterion | impl-1 | impl-2 | impl-3 | Best |
|-----------|--------|--------|--------|------|
| Fitness for Purpose | | | | |
| Justified Complexity | | | | |
| Readability | | | | |
| Robustness & Scale | | | | |
| Maintainability | | | | |
| **TOTAL** | /25 | /25 | /25 | |
## Hard Gates
| Gate | Result |
|------|--------|
| Fitness Gate (Δ ≥ 2) | Triggered/Not triggered |
| Critical Flaw (any = 1) | Triggered/Not triggered |
## Winner Selection
**Winner: impl-X** (Score: __/25)
**Selection rationale:**
[2-3 sentences explaining WHY this implementation won]
**Trade-offs acknowledged:**
[What the other implementations did better]
Scoring Reference
Scores Meaning
| Score | Meaning |
|---|---|
| 5 | Excellent - exceeds expectations |
| 4 | Good - fully meets requirements |
| 3 | Adequate - core works, some gaps |
| 2 | Poor - significant issues |
| 1 | Critical flaw - disqualifying |
Hard Gates (Automatic)
- Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately
- Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED
Fitness Gate Interpretation
The Fitness Gate triggers the same way in both contexts, but means different things:
| Context | What Fitness Δ ≥ 2 Means |
|---|---|
| Cookoff | One implementation deviated from or misunderstood the design. All impls should have similar Fitness since they're implementing the same spec. A large gap is a red flag. |
| Omakase | One approach genuinely solves the problem better. Different approaches can legitimately have different Fitness. A large gap means one approach is clearly superior. |
In both cases, higher Fitness wins. The interpretation just explains why the gap exists.
Feasibility Red Flags
Check before scoring:
- O(n²) or worse on unbounded data
- Unbounded memory growth
- Self-DDoS patterns (polling, no backoff)
- Missing pagination
- Blocking I/O in hot path
- No error recovery
Process
- Read all implementation code (should already be in context)
- Fill out the worksheet for EACH implementation - do not skip sections
- Check hard gates
- Announce winner with rationale
CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.
CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
