
pop-benchmark-runner
by jrc1883
AI-powered development workflow automation for Claude Code. Modular plugin suite with 23 commands, 38 skills, and 22 specialized agents for professional software development.
SKILL.md
name: pop-benchmark-runner description: "Orchestrates benchmark execution comparing PopKit vs baseline Claude Code" version: "1.0.0" category: "operations" context: "fork"
Benchmark Runner Skill
Overview
Automates the execution of benchmark tasks that quantitatively measure PopKit's value by comparing AI-assisted development with PopKit enabled vs without PopKit (baseline Claude Code).
Core Functions
- Orchestrated Execution - Spawns trials in separate windows for side-by-side viewing
- Task Execution - Runs benchmark tasks in isolated git worktrees
- Recording Collection - Captures all tool calls, durations, and results
- Statistical Analysis - Calculates t-tests, effect sizes, confidence intervals
- Report Generation - Creates markdown and HTML reports with visualizations
Orchestrated Execution (Option B Architecture)
The benchmark runner uses an orchestrator pattern where the current Claude session becomes the orchestrator and spawns trial sessions in separate terminal windows.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Current Claude Session (Orchestrator) │
│ │
│ 1. Load task definition and responses │
│ 2. For each trial: │
│ ├─ Spawn WITH PopKit window → New terminal │
│ ├─ Spawn BASELINE window → New terminal │
│ └─ Monitor via recording files (poll every 3s) │
│ 3. Collect all recordings when complete │
│ 4. Run statistical analysis │
│ 5. Generate and open HTML report │
└─────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ WITH PopKit │ │ BASELINE │
│ Terminal Window │ │ Terminal Window │
│ │ │ │
│ Claude Code │ │ Claude Code │
│ + PopKit enabled │ │ PopKit disabled │
│ │ │ │
│ Recording: JSON │ │ Recording: JSONL │
└──────────────────┘ └──────────────────┘
User Experience Flow
# User runs in current Claude session
/popkit-ops:benchmark run jwt-authentication --trials 3
# Orchestrator takes over:
🚀 PopKit Benchmark Suite
▶ Trial 1/3 WITH PopKit - Launching window...
[New terminal window opens → user sees Claude working with PopKit]
▶ Trial 1/3 BASELINE - Launching window...
[New terminal window opens → user sees vanilla Claude working]
⏳ Monitoring trials... (watch the windows work)
✓ Trial 1 WITH PopKit completed (45s)
✓ Trial 1 BASELINE completed (68s)
...
📊 Analyzing results...
📈 Generating HTML report...
🎉 Opening report in browser...
Cross-Platform Window Spawning
Windows:
cmd = f'start "Claude Benchmark - {session_id}" cmd /k "cd /d {worktree_path} && claude "{prompt}""'
subprocess.Popen(cmd, env=env, shell=True)
Mac:
script = f'''
tell application "Terminal"
do script "cd '{worktree_path}' && claude '{prompt}'"
activate
end tell
'''
subprocess.Popen(["osascript", "-e", script], env=env)
Linux:
cmd = ["gnome-terminal", "--", "bash", "-c", f"cd '{worktree_path}' && claude '{prompt}'; exec bash"]
subprocess.Popen(cmd, env=env)
Monitoring Trials
The orchestrator monitors trial completion by polling recording files:
def _monitor_trials(session_ids: List[str], timeout: int = 3600):
"""Wait for all trials to complete."""
while len(completed) < len(session_ids):
for session_id in session_ids:
# Check for recording
recording_path = _find_recording(session_id)
if recording_path and _is_complete(recording_path):
completed.add(session_id)
time.sleep(3) # Poll every 3 seconds
Recording Locations:
- WITH PopKit:
~/.claude/popkit/recordings/<session_id>.json - BASELINE:
~/.claude/projects/*/<session_id>.jsonl
Completion Detection:
- JSON: Check for
session_endevent - JSONL: Check if file has content and last line is valid JSON
Implementation
Main orchestrator entry point:
from benchmark_orchestrator import BenchmarkOrchestrator
orchestrator = BenchmarkOrchestrator(
task_id="jwt-authentication",
trials=3,
verbose=True
)
report_path = orchestrator.run()
# Returns: Path to HTML report, automatically opened in browser
Workflow
Phase 1: Setup
from popkit_shared.utils.benchmark_responses import is_benchmark_mode, load_responses_file
from pathlib import Path
import yaml
# Load task definition
task_file = Path(f"packages/popkit-ops/tasks/{category}/{task_id}.yml")
task_def = yaml.safe_load(task_file.read_text())
# Setup benchmark responses for automation
os.environ["POPKIT_BENCHMARK_MODE"] = "true"
os.environ["POPKIT_BENCHMARK_RESPONSES"] = f"packages/popkit-ops/tasks/{category}/{task_id}-responses.json"
Phase 2: Worktree Creation
from pop_benchmark_runner.scripts.codebase_manager import CodebaseManager
manager = CodebaseManager()
# Create isolated worktree for each trial
for trial in range(1, trials + 1):
worktree_path = manager.create_worktree(
task_id=task_id,
trial_num=trial,
baseline_ref=task_def["initial_state"]
)
# Returns: benchmark-worktrees/<task-id>-<trial>-<timestamp>/
Phase 3: Benchmark Execution
from pop_benchmark_runner.scripts.benchmark_runner import BenchmarkRunner
runner = BenchmarkRunner(task_def, trials=5)
# Run with PopKit (recording enabled)
with_popkit_recordings = runner.run_with_popkit()
# Run without PopKit (baseline, recording enabled)
without_popkit_recordings = runner.run_baseline()
# Returns: List[Path] - paths to recording JSON files
Phase 4: Analysis
from pop_benchmark_runner.scripts.benchmark_analyzer import BenchmarkAnalyzer
analyzer = BenchmarkAnalyzer(
with_popkit_recordings=with_popkit_recordings,
baseline_recordings=without_popkit_recordings
)
results = analyzer.analyze()
# Returns: {
# "metrics": { metric_name: { "with_popkit": ..., "baseline": ..., "improvement": ..., "p_value": ..., "effect_size": ... } },
# "summary": { "significant_metrics": ..., "overall_effect_size": ... }
# }
Phase 5: Reporting
from pop_benchmark_runner.scripts.report_generator import ReportGenerator
generator = ReportGenerator(results, task_def)
# Generate markdown report
md_report = generator.generate_markdown()
# Generate HTML dashboard
html_report = generator.generate_html()
# Save reports
output_dir = Path("docs/benchmark/results")
output_dir.mkdir(parents=True, exist_ok=True)
(output_dir / f"{task_id}-{timestamp}.md").write_text(md_report)
(output_dir / f"{task_id}-{timestamp}.html").write_text(html_report)
Metrics Collected
1. Context Usage (Tokens)
Source: routine_measurement.py token estimation
from popkit_shared.utils.routine_measurement import RoutineMeasurement
measurement = RoutineMeasurement.from_recording(recording_path)
context_usage = measurement.total_tokens
2. Tool Calls
Source: recording_analyzer.py tool usage breakdown
from popkit_shared.utils.recording_analyzer import RecordingAnalyzer
analyzer = RecordingAnalyzer(recording_path)
tool_usage = analyzer.get_tool_usage_breakdown()
total_tool_calls = sum(tool["count"] for tool in tool_usage.values())
3. Backtracking (Code Reverts)
Source: transcript_parser.py file edit detection
from popkit_shared.utils.transcript_parser import parse_transcript
transcript = parse_transcript(recording_path)
file_edits = {}
for event in transcript:
if event["type"] in ["Write", "Edit"]:
file_path = event["parameters"].get("file_path")
if file_path not in file_edits:
file_edits[file_path] = []
file_edits[file_path].append(event)
# Count files edited multiple times (backtracking)
backtracking_count = sum(1 for edits in file_edits.values() if len(edits) > 1)
4. Error Recovery
Source: recording_analyzer.py error summary
analyzer = RecordingAnalyzer(recording_path)
error_summary = analyzer.get_error_summary()
error_count = error_summary["total_errors"]
error_rate = error_summary["error_rate"]
5. Time to Complete
Source: recording_analyzer.py performance metrics
analyzer = RecordingAnalyzer(recording_path)
performance = analyzer.get_performance_metrics()
duration_ms = performance["total_duration_ms"]
6. Code Quality
Source: Verification command exit codes
verification_commands = task_def["verification"]
# E.g., ["npm test", "npm run lint", "npx tsc --noEmit"]
quality_results = []
for cmd in verification_commands:
result = subprocess.run(cmd.split(), capture_output=True)
quality_results.append(result.returncode == 0)
pass_rate = sum(quality_results) / len(quality_results)
Statistical Analysis
T-Test (Statistical Significance)
from scipy import stats
import numpy as np
# Extract metric values from each trial
with_popkit_values = [trial["context_usage"] for trial in with_popkit_trials]
baseline_values = [trial["context_usage"] for trial in baseline_trials]
# Perform independent samples t-test
t_statistic, p_value = stats.ttest_ind(with_popkit_values, baseline_values)
# Interpret: p < 0.05 means statistically significant difference
is_significant = p_value < 0.05
Cohen's d (Effect Size)
def cohens_d(group1, group2):
"""Calculate Cohen's d effect size."""
mean1, mean2 = np.mean(group1), np.mean(group2)
std1, std2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
# Pooled standard deviation
n1, n2 = len(group1), len(group2)
pooled_std = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))
# Effect size
d = (mean1 - mean2) / pooled_std
return abs(d)
effect_size = cohens_d(with_popkit_values, baseline_values)
# Interpret:
# d < 0.2: small effect
# d >= 0.5: medium effect
# d >= 0.8: large effect
Confidence Intervals
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval for mean."""
mean = np.mean(data)
sem = stats.sem(data) # Standard error of mean
ci = sem * stats.t.ppf((1 + confidence) / 2, len(data) - 1)
return (mean - ci, mean + ci)
with_popkit_ci = confidence_interval(with_popkit_values)
baseline_ci = confidence_interval(baseline_values)
Task Definition Format
Task definitions are YAML files in packages/popkit-ops/tasks/<category>/<task-id>.yml:
id: jwt-authentication
category: feature-addition
description: Add JWT-based user authentication to Express API
codebase: demo-app-express
initial_state: git checkout baseline-v1.0
user_prompt: |
Implement JWT authentication with:
- POST /auth/login endpoint (username/password)
- JWT token generation with 1-hour expiry
- Protected middleware for authenticated routes
- Error handling for invalid credentials
verification:
- npm test
- npm run lint
- npx tsc --noEmit
expected_outcomes:
- "/auth/login endpoint exists"
- "Tests pass for authentication flow"
- "Protected routes return 401 without token"
Benchmark Response Files
Response files enable automation without user interaction (<task-id>-responses.json):
{
"responses": {
"Auth method": "JWT (jsonwebtoken library)",
"Token storage": "HTTP-only cookies (security best practice)",
"Token expiry": "1 hour (3600 seconds)",
"Error handling": "Standard HTTP status codes (401, 403, 500)"
},
"standardAutoApprove": [
"install.*dependencies",
"run.*tests",
"commit.*changes"
]
}
Environment Variables
POPKIT_RECORD=true # Enable session recording
POPKIT_BENCHMARK_MODE=true # Enable benchmark automation
POPKIT_BENCHMARK_RESPONSES=<path-to-responses.json> # Response file
POPKIT_COMMAND=benchmark-<task-id> # Command name for recording
Output Structure
docs/benchmark/results/
├── jwt-authentication-2025-01-15.md # Markdown report
├── jwt-authentication-2025-01-15.html # HTML dashboard
├── jwt-authentication-2025-01-15.json # Raw data
└── index.html # Dashboard landing page
Error Handling
Worktree Creation Failures
try:
worktree_path = manager.create_worktree(task_id, trial, baseline_ref)
except WorktreeExistsError:
# Cleanup and retry
manager.cleanup_worktree(task_id, trial)
worktree_path = manager.create_worktree(task_id, trial, baseline_ref)
except GitError as e:
print(f"[ERROR] Git operation failed: {e}")
# Skip this trial, continue with others
Verification Command Failures
for cmd in verification_commands:
result = subprocess.run(cmd.split(), capture_output=True, timeout=300)
if result.returncode != 0:
print(f"[WARN] Verification command failed: {cmd}")
print(f" stdout: {result.stdout.decode()}")
print(f" stderr: {result.stderr.decode()}")
# Record failure but continue analysis
Recording Collection Failures
recordings = []
for recording_file in expected_recordings:
if not recording_file.exists():
print(f"[WARN] Recording not found: {recording_file}")
continue
try:
recordings.append(RecordingAnalyzer(recording_file))
except Exception as e:
print(f"[ERROR] Failed to load recording {recording_file}: {e}")
if len(recordings) < trials * 0.5: # Less than 50% success
raise InsufficientDataError("Too many recording failures for valid analysis")
Success Criteria
For a benchmark to be considered valid:
- ✅ At least 3 successful trials per configuration (with/without PopKit)
- ✅ All verification commands pass in with-PopKit configuration
- ✅ Statistical significance (p < 0.05) for at least 4/6 metrics
- ✅ Large effect size (Cohen's d > 0.8) for at least 3/6 metrics
- ✅ Consistent results (standard deviation < 20% of mean)
Related Files
benchmark_orchestrator.py- NEW: Orchestrates parallel trials in separate windowsbenchmark_runner.py- Single trial execution (called by orchestrator)benchmark_analyzer.py- Statistical analysiscodebase_manager.py- Git worktree managementreport_generator.py- Markdown/HTML reports../../../shared-py/popkit_shared/utils/recording_analyzer.py- Metrics extraction../../../shared-py/popkit_shared/utils/routine_measurement.py- Token tracking../../../shared-py/popkit_shared/utils/benchmark_responses.py- Automation
Testing
Unit tests verify each component:
python -m pytest packages/popkit-ops/skills/pop-benchmark-runner/tests/ -v
Integration test runs a simple benchmark:
/popkit-ops:benchmark run simple-feature --trials 1
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon

