llm-judge

Name: llm-judge
Rating: 75
Author: existential-birds

by existential-birds

Claude Code plugin for code review skills and verification workflows. Python, Go, React, FastAPI, BubbleTea, and AI frameworks (Pydantic AI, LangGraph, Vercel AI SDK).

⭐ 11🍴 2📅 Jan 24, 2026

ai-agents bubbletea claude-code claude-code-plugin code-review developer-tools fastapi golang

View on GitHub Run in Manus

SKILL.md

name: llm-judge description: LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.

Overview

This skill implements a two-phase LLM-as-judge evaluation:

Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
Phase 2: Judging - Parallel judges score each dimension using consistent rubrics

Reference Files

File	Purpose
references/fact-schema.md	JSON schema for Phase 1 facts
references/scoring-rubrics.md	Detailed rubrics for each dimension
references/repo-agent.md	Instructions for Phase 1 agents
references/judge-agents.md	Instructions for Phase 2 judges

Scoring Dimensions

Dimension	Default Weight	Evaluates
Functionality	30%	Spec compliance, test pass rate
Security	25%	Vulnerabilities, security patterns
Test Quality	20%	Coverage, DRY, mock boundaries
Overengineering	15%	Unnecessary complexity
Dead Code	10%	Unused code, TODOs

Scoring Scale

Score	Meaning
5	Excellent - Exceeds expectations
4	Good - Meets requirements, minor issues
3	Average - Functional but notable gaps
2	Below Average - Significant issues
1	Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

After Phase 2 completes:

Collect scores from all 5 judges

For each repo, compute weighted total:

weighted_total = sum(score[dim] * weight[dim]) / 100

Rank repos by weighted total (descending)
Generate verdict explaining the ranking

Output

Write results to .beagle/llm-judge-report.json and display markdown summary.

Dependencies

@beagle:llm-artifacts-detection - Reused by repo agents for dead code/overengineering

Score

Total Score

75/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

llm-judge

SKILL.md

name: llm-judge description: LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.

LLM Judge Skill

Overview

Reference Files

Scoring Dimensions

Scoring Scale

Phase 1: Spawning Repo Agents

Phase 2: Spawning Judge Agents

Aggregation

Output

Dependencies

Score

Reviews

create-pr

prompt-lookup

skill-lookup

cache-components

update-docs

orpc-contract-first

llm-judge

SKILL.md

name: llm-judge description: LLM-as-judge methodology for comparing code implementations across repositories. Scores implementations on functionality, security, test quality, overengineering, and dead code using weighted rubrics. Used by /beagle:llm-judge command.

LLM Judge Skill

Overview

Reference Files

Scoring Dimensions

Scoring Scale

Phase 1: Spawning Repo Agents

Phase 2: Spawning Judge Agents

Aggregation

Output

Dependencies

Score

Reviews

Related

Related Skills

create-pr

prompt-lookup

skill-lookup

cache-components

update-docs

orpc-contract-first