
benchmark-manager
by sunholo-data
For humans, a language is a tool for expression. For AIs, it's a substrate for reasoning.
SKILL.md
name: Benchmark Manager description: Create and manage AILANG eval benchmarks. Use when user asks to create benchmarks, fix benchmark issues, debug failing benchmarks, or analyze benchmark results.
Benchmark Manager
Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.
Quick Start
Debugging a failing benchmark:
# 1. Show the full prompt that models see
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse
# 2. Test a benchmark with a specific model
ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse
# 3. Check benchmark YAML for common issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml
When to Use This Skill
Invoke this skill when:
- User asks to create a new benchmark
- User asks to debug/fix a failing benchmark
- User wants to understand why models generate wrong code
- User asks about benchmark YAML format
- Benchmarks show 0% pass rate despite language support
CRITICAL: prompt vs task_prompt
This is the most important concept for benchmark management.
The Problem (v0.4.8 Discovery)
Benchmarks have TWO different prompt fields with VERY different behavior:
| Field | Behavior | Use When |
|---|---|---|
prompt: | REPLACES the teaching prompt entirely | Testing raw model capability (rare) |
task_prompt: | APPENDS to teaching prompt | Normal benchmarks (99% of cases) |
Why This Matters
# BAD - Model never sees AILANG syntax!
prompt: |
Write a program that prints "Hello"
# GOOD - Model sees teaching prompt + task
task_prompt: |
Write a program that prints "Hello"
With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.
How Prompts Combine
From internal/eval_harness/spec.go (lines 91-93):
fullPrompt := basePrompt // Teaching prompt from prompts/v0.4.x.md
if s.TaskPrompt != "" {
fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt
}
The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.
Available Scripts
scripts/show_full_prompt.sh
Shows the complete prompt that models receive for a benchmark.
Usage:
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id>
# Example:
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse
scripts/check_benchmark.sh
Validates a benchmark YAML file for common issues.
Usage:
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml
Checks for:
- Using
prompt:instead oftask_prompt:(warning) - Missing required fields
- Invalid capability names
- Syntax errors in YAML
scripts/test_benchmark.sh
Runs a quick single-model test of a benchmark.
Usage:
.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model]
# Examples:
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5
Benchmark YAML Format
Required Fields
id: my_benchmark # Unique identifier (snake_case)
description: "Short description of what this tests"
languages: ["python", "ailang"]
entrypoint: "main" # Function to call
caps: ["IO"] # Required capabilities
difficulty: "easy|medium|hard"
expected_gain: "low|medium|high"
task_prompt: | # ALWAYS use task_prompt, not prompt!
Write a program in <LANG> that:
1. Does something
2. Prints the result
Output only the code, no explanations.
expected_stdout: | # Exact expected output
expected output here
Capability Names
Valid capabilities: IO, FS, Clock, Net
# File I/O
caps: ["IO"]
# HTTP requests
caps: ["Net", "IO"]
# File system operations
caps: ["FS", "IO"]
Creating New Benchmarks
Step 1: Determine Requirements
- What language feature/capability is being tested?
- Can models solve this with just the teaching prompt?
- What's the expected output?
Step 2: Write the Benchmark
id: my_new_benchmark
description: "Test feature X capability"
languages: ["python", "ailang"]
entrypoint: "main"
caps: ["IO"]
difficulty: "medium"
expected_gain: "medium"
task_prompt: |
Write a program in <LANG> that:
1. Clear description of task
2. Another step
3. Print the result
Output only the code, no explanations.
expected_stdout: |
exact expected output
Step 3: Validate and Test
# Check for issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml
# Test with cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark
Debugging Failing Benchmarks
Symptom: 0% Pass Rate Despite Language Support
Check 1: Is it using task_prompt:?
grep -E "^prompt:" benchmarks/failing_benchmark.yml
# If this returns a match, change to task_prompt:
Check 2: What prompt do models see?
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark
Check 3: Is the teaching prompt up to date?
# After editing prompts/v0.x.x.md, you MUST rebuild:
make quick-install
Symptom: Models Copy Template Instead of Solving Task
The teaching prompt includes a template structure. If models copy it verbatim:
- Make sure task is clearly different from examples in teaching prompt
- Check that
task_promptexplicitly describes what to do - Consider if the task description is ambiguous
Symptom: compile_error on Valid Syntax
Common AILANG-specific issues models get wrong:
| Wrong | Correct | Notes |
|---|---|---|
print(42) | print(show(42)) | print expects string |
a % b | mod_Int(a, b) | No % operator |
def main() | export func main() | Wrong keyword |
for x in xs | match xs { ... } | No for loops |
If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).
Common Mistakes
1. Using prompt: Instead of task_prompt:
# WRONG - Models never see AILANG syntax
prompt: |
Write code that...
# CORRECT - Teaching prompt + task
task_prompt: |
Write code that...
2. Forgetting to Rebuild After Prompt Changes
# After editing prompts/v0.x.x.md:
make quick-install # REQUIRED!
3. Putting Hints in Benchmarks
# WRONG - Hints in benchmark
task_prompt: |
Write code that prints 42.
Hint: Use print(show(42)) in AILANG.
# CORRECT - No hints; if models fail, fix the teaching prompt
task_prompt: |
Write code that prints 42.
If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.
4. Testing Too Many Models at Once
# WRONG - Expensive and slow for debugging
ailang eval-suite --full --benchmarks my_test
# CORRECT - Use one cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test
Resources
Reference Guide
See resources/reference.md for:
- Complete list of valid benchmark fields
- Capability reference
- Example benchmarks
Related Skills
- prompt-manager: When benchmark failures indicate teaching prompt issues
- eval-analyzer: For analyzing results across many benchmarks
- use-ailang: For writing correct AILANG code
Notes
- Benchmarks live in
benchmarks/directory - Eval results go to
eval_results/directory - Teaching prompt is embedded in binary - rebuild after changes
- Use
<LANG>placeholder in task_prompt - it's replaced with "AILANG" or "Python"
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
