eval-analyzer

Name: eval-analyzer
Rating: 65
Author: sunholo-data

by sunholo-data

For humans, a language is a tool for expression. For AIs, it's a substrate for reasoning.

⭐ 16🍴 2📅 Jan 22, 2026

ai ai-agents anthropic gemini genai openai

View on GitHub Run in Manus

SKILL.md

name: Eval Analyzer description: Identify AILANG language gaps from agent struggles, analyze eval baselines, and generate actionable insights. PRIMARY PURPOSE is finding what stdlib/prompt improvements would help agents succeed. Use when analyzing eval results, checking benchmarks, or investigating failures.

Eval Analyzer

Primary goal: Identify AILANG language gaps from what agents struggle with → drives stdlib additions and prompt improvements.

Secondary: Analyze eval baseline results, compare model performance, track success rates.

Quick Start

Language gap analysis (most valuable):

# Find what stdlib/prompt improvements would help agents
.claude/skills/eval-analyzer/scripts/find_language_gaps.sh eval_results/baselines/v0.6.2

# Output shows:
# - Functions agents searched for but couldn't find
# - Undefined variable errors (hallucinated functions)
# - Type confusion patterns
# - Benchmarks with stuck loops (high turn count)
# - Mapping of hallucinated names to actual builtins

Standard analysis:

# User says: "Analyze the v0.3.24 eval results"
# This skill will:
# 1. Run find_language_gaps.sh to identify AILANG improvements needed
# 2. Run eval-analyze to categorize failures
# 3. Run agent KPIs to analyze efficiency
# 4. Identify top failing benchmarks
# 5. Generate actionable recommendations

For agent evaluation analysis (NEW - optimization focus):

# Step 1: Get efficiency metrics (turns, tokens, cost)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24

# Step 2: Investigate expensive benchmarks
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 simple_print

# Step 3: Compare Python vs AILANG
./tools/compare_agents.sh eval_results/baselines/v0.3.24

See resources/agent_optimization_guide.md for complete optimization strategies.

Language Gap Analysis (PRIMARY GOAL)

The most valuable output of eval analysis is identifying AILANG language gaps - what agents struggle with that reveals missing stdlib functions, undocumented features, or prompt gaps.

Why This Matters

When an agent fails after many turns, the transcript reveals what it was trying to do:

[config_file_parser - 22 turns]
Turn 14: "Perfect! There's floatToInt in std/prelude" -- HALLUCINATED!
Turn 15: "It looks like floatToInt is a builtin, not from a module"
Turn 18: "Let me add a floatToInt using basic arithmetic" -- broken workaround

Insight: Agent knows what it needs (floatToInt) but can't find it → stdlib gap.

Language Gap Workflow

# Step 1: Find stuck loops - agents searching for functions
cat eval_results/baselines/v0.6.2/agent/*_ailang_*.json | \
  jq -r 'select(.stdout_ok == false) | .stderr' | \
  grep -i "let me check\|what function\|is available\|undefined variable" | head -20

# Step 2: Check if builtin exists for hallucinated function
ailang builtins list | grep -i "float\|int"

# Step 3: Check if stdlib wrapper exists
grep -i "floatToInt" std/*.ail

# Step 4: If builtin exists but wrapper doesn't → Add stdlib wrapper
# Step 5: If wrapper exists but agent didn't know → Update prompt

Gap Pattern Categories

Agent Behavior	Gap Type	Fix
"Let me check what X is available" then fails	Missing stdlib wrapper	Add wrapper to std/
Uses function that exists but wrong name	Undocumented	Document in prompt
Tries Python syntax in AILANG	Prompt gap	Add AILANG examples
10+ turns on same type error	Type confusion	Add type examples to prompt
`undefined variable: floatToInt`	Missing wrapper	Add `floatToInt = _float_to_int`

Example Gap Report

After analysis, produce actionable output:

## Missing Wrappers (builtin exists, wrapper doesn't)
| Function | Builtin | Add to | Impact |
|----------|---------|--------|--------|
| floatToInt | _float_to_int | std/math | 3 benchmarks |
| intToFloat | _int_to_float | std/math | 2 benchmarks |

## Undocumented (exists but agents don't know)
| Function | Module | Agents looked for |
|----------|--------|-------------------|
| substring | std/string | stringSlice, slice |
| contains | std/string | includes, has |

## Prompt Gaps (syntax confusion)
| Issue | Example | Add to prompt |
|-------|---------|---------------|
| Json vs string | `get(str, key)` fails | Json type examples |
| String not list | `match s { [a,b] => }` | String handling section |

See design_docs/planned/v0_6_5/m-eval-gap-analysis.md for full analysis example.

When to Use This Skill

Invoke this skill when:

User asks to "analyze eval results", "check benchmarks", "what's failing"
After running an eval baseline
When investigating why benchmark performance changed
User wants to understand failure patterns or model performance
Comparing two versions of AILANG
Identifying AILANG language gaps from what agents struggle with

Key Eval Commands

All commands work on baseline directories like eval_results/baselines/v0.3.16/.

1. Quick Overview - `eval-matrix`

Shows comprehensive statistics with model/language breakdowns.

ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60

Shows: Overall stats, per-model performance, per-language breakdown, top error codes.

2. Detailed Analysis - `eval-analyze`

Categorizes failures and can generate design docs for issues.

# Dry run (no design docs, just analysis)
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run

# Full analysis with design doc generation
ailang eval-analyze -results eval_results/baselines/v0.3.16

⚠️ CRITICAL: Must use -results flag, NOT positional argument!

Output: Categorized failures (compile_error, logic_error, runtime_error) with frequency, affected benchmarks, models, and sample errors.

3. Query-Friendly Summary - `eval-summary`

Generates JSONL for easy querying with jq.

ailang eval-summary eval_results/baselines/v0.3.16

Output: eval_results/baselines/v0.3.16/summary.jsonl

4. Compare Versions - `eval-compare`

Shows what changed between two versions.

ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16

5. Fair Comparison (RECOMMENDED) - `fair_comparison.py`

Use this for accurate version comparisons! The eval-compare command may include duplicates or different model sets. This script normalizes data for apple-to-apples comparison.

.claude/skills/eval-analyzer/scripts/fair_comparison.py

What it does:

Deduplicates runs (keeps last run per benchmark+model)
Filters to dev models only (gpt5-mini, claude-haiku-4-5, gemini-2-5-flash)
AILANG only (ignores Python results)
Shows net fixes vs regressions
Per-model breakdown

Output:

v0.4.0: 56/123 = 45.5%
v0.4.2: 59/123 = 48.0%
Delta:  +3 (+2.4pp)

✅ Fixed:   11 benchmarks
❌ Broken:  8 benchmarks
NET:       +3 benchmarks

When to use: Before making decisions based on eval results (e.g., reverting changes, merging PRs).

6. Validate Results - `validate_eval_results.py`

Check for output corruption and race conditions in eval results.

python3 tools/validate_eval_results.py eval_results/baselines/v0.4.2

Checks:

Output corruption (fibonacci outputting "All results equal", etc.)
Duplicate runs for same benchmark+model
Code hash validation (if available)
Success rate statistics

When to use: After running eval baselines, especially if results look suspicious.

Agent Analysis Scripts (NEW!)

For agent-based evaluation results (Python vs AILANG comparisons with Claude Code):

1. Agent KPIs - Minimize Tokens & Turns

Shows efficiency metrics for agent runs - key for optimizing language and prompts.

.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/WITH_ALL_FIXES

Output:

Average turns, tokens, cost by language (Python vs AILANG)
Most expensive benchmarks (by turns) - candidates for optimization
Most efficient benchmarks - learn from these
Success rates and performance comparison

Goal: Minimize agent turns and tokens → indicates clearer prompts and simpler language.

2. Agent Transcripts - View AILANG Conversations

View full agent conversation logs to understand what happened.

# View all transcripts
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES

# View only failures
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES --failed-only

# View specific benchmark
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES fizzbuzz

Output:

Turn-by-turn conversation showing agent's thought process
Metrics: turns, tokens, duration
Success/failure status with error category
First 100 lines of transcript (with hint to view full)

Use for: Understanding why AILANG solutions fail or take many turns.

3. Python vs AILANG Comparison

Use the existing tools/compare_agents.sh script for side-by-side comparison:

./tools/compare_agents.sh eval_results/WITH_ALL_FIXES

Output:

Side-by-side metrics table
Solution code comparison
Transcripts for failed solutions (automatic)
Winner indicators for each metric

Standard Eval Workflow (Non-Agent)

Step 1: Get High-Level Overview

# Show overall statistics
ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60

Look for:

Overall success rate (target: >60%)
AILANG vs Python gap (current: ~54%)
Model performance variance
Top error codes

Step 2: Identify Problem Areas

# Categorize all failures
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run

Key metrics:

compile_error frequency (parse/syntax issues)
logic_error frequency (wrong output)
runtime_error frequency (crashes)
Which benchmarks fail most

Step 3: Query with jq (Custom Analysis)

Use jq queries on summary.jsonl for custom analysis:

# Ensure summary exists
ailang eval-summary eval_results/baselines/v0.3.20

# AILANG-only success rate (all models)
jq -s 'map(select(.lang == "ailang")) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Dev models only (useful for prompt testing)
jq -s 'map(select(.lang == "ailang" and
  (.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Check specific benchmark across all models
jq -s 'map(select(.benchmark == "explicit_state_threading" and .lang == "ailang")) |
  map({model, success: .stdout_ok, error: .error_category})' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Compare two versions (dev models AILANG-only)
jq -s 'map(select(.lang == "ailang" and
  (.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl \
  eval_results/baselines/v0.3.21/summary.jsonl

For more jq patterns, see resources/jq_queries.md

Step 4: Deep Dive with Helper Scripts

Use the provided helper scripts for detailed code inspection:

# Failure analysis with error categorization
.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16

# Model performance comparison
.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16

# Examine specific benchmark failures
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json

Step 4: Compare with Previous Version

# Show regressions and improvements
ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16

Step 5: Generate Insights

Based on the data, identify:

Systemic Issues: Categories with >50 failures
Model Patterns: Which models struggle with which features
Benchmark Hotspots: Benchmarks with 100% failure rate
Cost Efficiency: Which models give best success/cost ratio
Trends: Improvements or regressions vs previous version

Key Metrics to Track

Overall Success Rate: AILANG vs Python gap (target: reduce below 50%)
Error Code Distribution:
- PAR_001 (parse errors) - indicates prompt/syntax issues
- WRONG_LANG - models writing Python instead of AILANG
- IMPERATIVE - models using imperative patterns
Model Performance: Which models work best with AILANG
Benchmark-Level: Which benchmarks consistently fail
Cost Efficiency: Success rate per dollar spent
Repair Success: Is self-repair helping? (currently low)

Common Issues

Issue 1: "Total Runs: 6" instead of 408

Symptom: eval-analyze only finds 6 results

Cause: Used positional argument instead of -results flag

Solution:

# ❌ WRONG
ailang eval-analyze eval_results/baselines/v0.3.16

# ✅ CORRECT
ailang eval-analyze -results eval_results/baselines/v0.3.16

Issue 2: Summary file not found

Symptom: jq queries fail with "file not found"

Cause: Need to run eval-summary first

Solution:

ailang eval-summary eval_results/baselines/v0.3.16

Issue 3: Design docs not generated

Symptom: eval-analyze shows issues but doesn't create docs

Cause: Using -dry-run flag

Solution: Run without -dry-run to generate design docs

Helper Scripts

The skill includes helper scripts in scripts/ directory:

quick_summary.sh

Fast overview using eval-matrix.

.claude/skills/eval-analyzer/scripts/quick_summary.sh eval_results/baselines/v0.3.16

Output: Overall stats, model performance, language breakdown, top error codes.

analyze_failures.sh

Detailed failure analysis with error categorization.

.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16 ailang

Output: Overall statistics, error categories, top failing benchmarks, model performance, error codes.

compare_models.sh

Model-by-model performance comparison.

.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16

Output: Success rates, first-attempt vs final, cost analysis, token usage, best model per benchmark.

examine_code.sh

Inspect generated code from specific benchmarks.

.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json gpt5

Output: Generated code, compiler errors, success status, error codes for each model run.

examine_prompts.sh

View prompts used for specific benchmarks.

.claude/skills/eval-analyzer/scripts/examine_prompts.sh eval_results/baselines/v0.3.16 api_call_json

Output: System prompt, user prompt, success status for benchmark runs.

verify_prompt_accuracy.sh

Check if prompt documentation matches actual implementation.

.claude/skills/eval-analyzer/scripts/verify_prompt_accuracy.sh v0.3.16

Output: Reports false limitations, undocumented features, and prompt-code mismatches.

Use this: After creating new prompt versions to catch documentation bugs!

Resources

Analysis Documents

resources/failure_analysis_v0.3.16.md - Comprehensive analysis of v0.3.16 eval results with root cause analysis

Common jq Patterns

See resources/jq_queries.md for more query examples and patterns.

Progressive Disclosure

This skill loads information progressively:

Always loaded: This SKILL.md file (workflow + commands + scripts)
Execute as needed: ailang eval-* commands and helper scripts
Load on demand: resources/jq_queries.md, analysis documents

Notes

All eval commands work offline (no API calls for analysis)
eval-analyze generates design docs using LLM (default: gpt5)
Summary JSONL format is stable and queryable
Use -dry-run to preview before generating design docs
baseline directories typically at eval_results/baselines/vX.X.X/
This skill complements post-release skill (which runs baselines)

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

eval-analyzer

SKILL.md

Eval Analyzer

Quick Start

Language Gap Analysis (PRIMARY GOAL)

Why This Matters

Language Gap Workflow

Gap Pattern Categories

Example Gap Report

When to Use This Skill

Key Eval Commands

1. Quick Overview - eval-matrix

2. Detailed Analysis - eval-analyze

3. Query-Friendly Summary - eval-summary

4. Compare Versions - eval-compare

5. Fair Comparison (RECOMMENDED) - fair_comparison.py

6. Validate Results - validate_eval_results.py

Agent Analysis Scripts (NEW!)

1. Agent KPIs - Minimize Tokens & Turns

2. Agent Transcripts - View AILANG Conversations

3. Python vs AILANG Comparison

Standard Eval Workflow (Non-Agent)

Step 1: Get High-Level Overview

Step 2: Identify Problem Areas

Step 3: Query with jq (Custom Analysis)

Step 4: Deep Dive with Helper Scripts

Step 4: Compare with Previous Version

Step 5: Generate Insights

Key Metrics to Track

Common Issues

Issue 1: "Total Runs: 6" instead of 408

Issue 2: Summary file not found

Issue 3: Design docs not generated

Helper Scripts

quick_summary.sh

analyze_failures.sh

compare_models.sh

examine_code.sh

examine_prompts.sh

verify_prompt_accuracy.sh

Resources

Analysis Documents

Common jq Patterns

Progressive Disclosure

Notes

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

1. Quick Overview - `eval-matrix`

2. Detailed Analysis - `eval-analyze`

3. Query-Friendly Summary - `eval-summary`

4. Compare Versions - `eval-compare`

5. Fair Comparison (RECOMMENDED) - `fair_comparison.py`

6. Validate Results - `validate_eval_results.py`