Back to list
sunholo-data

benchmark-manager

by sunholo-data

For humans, a language is a tool for expression. For AIs, it's a substrate for reasoning.

16🍴 2📅 Jan 22, 2026

SKILL.md


name: Benchmark Manager description: Create and manage AILANG eval benchmarks. Use when user asks to create benchmarks, fix benchmark issues, debug failing benchmarks, or analyze benchmark results.

Benchmark Manager

Manage AILANG evaluation benchmarks with correct prompt integration, debugging workflows, and best practices learned from real benchmark failures.

Quick Start

Debugging a failing benchmark:

# 1. Show the full prompt that models see
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

# 2. Test a benchmark with a specific model
ailang eval-suite --models claude-haiku-4-5 --benchmarks json_parse

# 3. Check benchmark YAML for common issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/json_parse.yml

When to Use This Skill

Invoke this skill when:

  • User asks to create a new benchmark
  • User asks to debug/fix a failing benchmark
  • User wants to understand why models generate wrong code
  • User asks about benchmark YAML format
  • Benchmarks show 0% pass rate despite language support

CRITICAL: prompt vs task_prompt

This is the most important concept for benchmark management.

The Problem (v0.4.8 Discovery)

Benchmarks have TWO different prompt fields with VERY different behavior:

FieldBehaviorUse When
prompt:REPLACES the teaching prompt entirelyTesting raw model capability (rare)
task_prompt:APPENDS to teaching promptNormal benchmarks (99% of cases)

Why This Matters

# BAD - Model never sees AILANG syntax!
prompt: |
  Write a program that prints "Hello"

# GOOD - Model sees teaching prompt + task
task_prompt: |
  Write a program that prints "Hello"

With prompt:, models generate Python/pseudo-code because they never learn AILANG syntax.

How Prompts Combine

From internal/eval_harness/spec.go (lines 91-93):

fullPrompt := basePrompt  // Teaching prompt from prompts/v0.4.x.md
if s.TaskPrompt != "" {
    fullPrompt = fullPrompt + "\n\n## Task\n\n" + s.TaskPrompt
}

The teaching prompt teaches AILANG syntax; task_prompt adds the specific task.

Available Scripts

scripts/show_full_prompt.sh

Shows the complete prompt that models receive for a benchmark.

Usage:

.claude/skills/benchmark-manager/scripts/show_full_prompt.sh <benchmark_id>

# Example:
.claude/skills/benchmark-manager/scripts/show_full_prompt.sh json_parse

scripts/check_benchmark.sh

Validates a benchmark YAML file for common issues.

Usage:

.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/<name>.yml

Checks for:

  • Using prompt: instead of task_prompt: (warning)
  • Missing required fields
  • Invalid capability names
  • Syntax errors in YAML

scripts/test_benchmark.sh

Runs a quick single-model test of a benchmark.

Usage:

.claude/skills/benchmark-manager/scripts/test_benchmark.sh <benchmark_id> [model]

# Examples:
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse
.claude/skills/benchmark-manager/scripts/test_benchmark.sh json_parse claude-haiku-4-5

Benchmark YAML Format

Required Fields

id: my_benchmark              # Unique identifier (snake_case)
description: "Short description of what this tests"
languages: ["python", "ailang"]
entrypoint: "main"            # Function to call
caps: ["IO"]                  # Required capabilities
difficulty: "easy|medium|hard"
expected_gain: "low|medium|high"
task_prompt: |                # ALWAYS use task_prompt, not prompt!
  Write a program in <LANG> that:
  1. Does something
  2. Prints the result

  Output only the code, no explanations.
expected_stdout: |            # Exact expected output
  expected output here

Capability Names

Valid capabilities: IO, FS, Clock, Net

# File I/O
caps: ["IO"]

# HTTP requests
caps: ["Net", "IO"]

# File system operations
caps: ["FS", "IO"]

Creating New Benchmarks

Step 1: Determine Requirements

  • What language feature/capability is being tested?
  • Can models solve this with just the teaching prompt?
  • What's the expected output?

Step 2: Write the Benchmark

id: my_new_benchmark
description: "Test feature X capability"
languages: ["python", "ailang"]
entrypoint: "main"
caps: ["IO"]
difficulty: "medium"
expected_gain: "medium"
task_prompt: |
  Write a program in <LANG> that:
  1. Clear description of task
  2. Another step
  3. Print the result

  Output only the code, no explanations.
expected_stdout: |
  exact expected output

Step 3: Validate and Test

# Check for issues
.claude/skills/benchmark-manager/scripts/check_benchmark.sh benchmarks/my_new_benchmark.yml

# Test with cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_new_benchmark

Debugging Failing Benchmarks

Symptom: 0% Pass Rate Despite Language Support

Check 1: Is it using task_prompt:?

grep -E "^prompt:" benchmarks/failing_benchmark.yml
# If this returns a match, change to task_prompt:

Check 2: What prompt do models see?

.claude/skills/benchmark-manager/scripts/show_full_prompt.sh failing_benchmark

Check 3: Is the teaching prompt up to date?

# After editing prompts/v0.x.x.md, you MUST rebuild:
make quick-install

Symptom: Models Copy Template Instead of Solving Task

The teaching prompt includes a template structure. If models copy it verbatim:

  1. Make sure task is clearly different from examples in teaching prompt
  2. Check that task_prompt explicitly describes what to do
  3. Consider if the task description is ambiguous

Symptom: compile_error on Valid Syntax

Common AILANG-specific issues models get wrong:

WrongCorrectNotes
print(42)print(show(42))print expects string
a % bmod_Int(a, b)No % operator
def main()export func main()Wrong keyword
for x in xsmatch xs { ... }No for loops

If models consistently make these mistakes, the teaching prompt needs improvement (use prompt-manager skill).

Common Mistakes

1. Using prompt: Instead of task_prompt:

# WRONG - Models never see AILANG syntax
prompt: |
  Write code that...

# CORRECT - Teaching prompt + task
task_prompt: |
  Write code that...

2. Forgetting to Rebuild After Prompt Changes

# After editing prompts/v0.x.x.md:
make quick-install  # REQUIRED!

3. Putting Hints in Benchmarks

# WRONG - Hints in benchmark
task_prompt: |
  Write code that prints 42.
  Hint: Use print(show(42)) in AILANG.

# CORRECT - No hints; if models fail, fix the teaching prompt
task_prompt: |
  Write code that prints 42.

If models need AILANG-specific hints, the teaching prompt is incomplete. Use the prompt-manager skill to fix it.

4. Testing Too Many Models at Once

# WRONG - Expensive and slow for debugging
ailang eval-suite --full --benchmarks my_test

# CORRECT - Use one cheap model first
ailang eval-suite --models claude-haiku-4-5 --benchmarks my_test

Resources

Reference Guide

See resources/reference.md for:

  • Complete list of valid benchmark fields
  • Capability reference
  • Example benchmarks
  • prompt-manager: When benchmark failures indicate teaching prompt issues
  • eval-analyzer: For analyzing results across many benchmarks
  • use-ailang: For writing correct AILANG code

Notes

  • Benchmarks live in benchmarks/ directory
  • Eval results go to eval_results/ directory
  • Teaching prompt is embedded in binary - rebuild after changes
  • Use <LANG> placeholder in task_prompt - it's replaced with "AILANG" or "Python"

Score

Total Score

65/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon