eval-gap-finder

Name: eval-gap-finder
Rating: 65
Author: sunholo-data

by sunholo-data

For humans, a language is a tool for expression. For AIs, it's a substrate for reasoning.

⭐ 16🍴 2📅 Jan 22, 2026

ai ai-agents anthropic gemini genai openai

View on GitHub Run in Manus

SKILL.md

name: Eval Gap Finder description: Find AILANG vs Python eval gaps and improve prompts/language. Use when user says 'find eval gaps', 'analyze benchmark failures', 'close Python-AILANG gap', or after running evals.

Eval Gap Finder

Automates the process of finding and closing the gap between Python and AILANG benchmark success rates. Identifies language limitations, prompt gaps, and missing stdlib functions.

Quick Start

Most common usage:

# User says: "Find eval gaps" or "Analyze benchmark failures"
# This skill will:
# 1. Run evals with dev models (gemini-3-flash, claude-haiku-4-5)
# 2. Compare Python vs AILANG success rates
# 3. Identify benchmarks where Python passes but AILANG fails
# 4. Analyze error patterns and categorize them
# 5. Check if gaps are documented in prompt
# 6. Test proposed examples and add to prompt
# 7. Create design docs for language limitations

When to Use This Skill

Invoke this skill when:

User asks to "find eval gaps" or "close the Python-AILANG gap"
User wants to analyze benchmark failures
After running evals and seeing lower AILANG success
User says "why is AILANG failing?" or "improve AILANG benchmarks"
User wants to identify language limitations

Available Scripts

`scripts/run_gap_analysis.sh [eval_dir]`

Run full gap analysis on eval results.

.claude/skills/eval-gap-finder/scripts/run_gap_analysis.sh eval_results/v0.6.5

`scripts/identify_python_only.sh <eval_dir>`

List benchmarks where Python passes but AILANG fails.

.claude/skills/eval-gap-finder/scripts/identify_python_only.sh eval_results/v0.6.5

`scripts/categorize_errors.sh <eval_dir>`

Categorize AILANG failures by error type.

.claude/skills/eval-gap-finder/scripts/categorize_errors.sh eval_results/v0.6.5

`scripts/test_example.sh <code>`

Test if an AILANG code example compiles and runs correctly.

.claude/skills/eval-gap-finder/scripts/test_example.sh /tmp/test.ail

Workflow

1. Run Evals with Dev Models

ailang eval-suite --models gemini-3-flash,claude-haiku-4-5 --output eval_results/gap-analysis

Target: Run with cheap/fast models first. If they succeed, larger models should too.

2. Generate Summary and Identify Gaps

ailang eval-summary eval_results/gap-analysis
.claude/skills/eval-gap-finder/scripts/identify_python_only.sh eval_results/gap-analysis

Key metrics:

AILANG success rate (target: >70%)
Python success rate (baseline)
Gap (python_only benchmarks)

3. Analyze Error Patterns

For each Python-only pass, categorize the error:

Category	Pattern	Fix Approach
WRONG_LANG	Model wrote Python syntax	Stronger "NOT Python" in prompt
PAR_001	Parse errors (syntax)	Add more examples to prompt
Type errors	Type unification failures	May be language limitation
Logic errors	Compiles but wrong output	Better examples or algorithm
EOF errors	Incomplete code generation	Model limitation, not prompt

4. Check Prompt Coverage

For each gap, check if the pattern is documented:

grep -n "pattern" prompts/v0.6.5.md

If not documented, add:

Working example to Quick Reference section
Entry in "What AILANG Does NOT Have" table (if limitation)
New section if pattern is complex

5. Test Examples Before Adding

CRITICAL: Always test examples before adding to prompt!

cat > /tmp/test.ail << 'EOF'
module benchmark/solution
-- Your example code here
EOF
ailang run --caps IO --entry main /tmp/test.ail

If example fails, it reveals a language gap - create a design doc instead.

6. Create Design Docs for Language Gaps

If testing reveals a language limitation:

Create design doc: design_docs/planned/vX_Y_Z/m-<feature>.md
Document:
- Minimal reproduction
- Error message
- Workaround (for prompt)
- Proposed fix
Add workaround to prompt with note

7. Track Improvement

After updates, re-run evals:

ailang eval-suite --models gemini-3-flash,claude-haiku-4-5 --output eval_results/gap-analysis-v2

Compare:

Success rate improvement
Which benchmarks fixed
Any regressions

Error Categories Reference

Error	Meaning	Fix
WRONG_LANG	Wrote Python instead	Prompt emphasis
PAR_001	Parser error	Syntax examples
PAR_UNEXPECTED_TOKEN	Wrong token	Syntax examples
TC_*	Type check error	Type examples or design doc
"undefined variable"	Missing import/letrec	Document pattern
EOF errors	Incomplete code	Model limitation
logic_error	Wrong output	Algorithm examples

Resources

Gap Analysis Template

See resources/gap_analysis_template.md for structured analysis format.

Common Patterns

See resources/common_patterns.md for frequently encountered gaps.

Progressive Disclosure

This skill loads information progressively:

Always loaded: This SKILL.md file (workflow overview)
Execute as needed: Scripts in scripts/ directory
Load on demand: Resources for templates and patterns

Notes

Always test examples before adding to prompt
Prefer fixing language over prompt workarounds
Track improvements with before/after eval runs
Create design docs for language limitations
Update prompt hash in versions.json after changes

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

eval-gap-finder

SKILL.md

name: Eval Gap Finder description: Find AILANG vs Python eval gaps and improve prompts/language. Use when user says 'find eval gaps', 'analyze benchmark failures', 'close Python-AILANG gap', or after running evals.

Eval Gap Finder

Quick Start

When to Use This Skill

Available Scripts

`scripts/run_gap_analysis.sh [eval_dir]`

`scripts/identify_python_only.sh <eval_dir>`

`scripts/categorize_errors.sh <eval_dir>`

`scripts/test_example.sh <code>`

Workflow

1. Run Evals with Dev Models

2. Generate Summary and Identify Gaps

3. Analyze Error Patterns

4. Check Prompt Coverage

5. Test Examples Before Adding

6. Create Design Docs for Language Gaps

7. Track Improvement

Error Categories Reference

Resources

Gap Analysis Template

Common Patterns

Progressive Disclosure

Notes

Score

Reviews

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

eval-gap-finder

SKILL.md

name: Eval Gap Finder description: Find AILANG vs Python eval gaps and improve prompts/language. Use when user says 'find eval gaps', 'analyze benchmark failures', 'close Python-AILANG gap', or after running evals.

Eval Gap Finder

Quick Start

When to Use This Skill

Available Scripts

scripts/run_gap_analysis.sh [eval_dir]

scripts/identify_python_only.sh <eval_dir>

scripts/categorize_errors.sh <eval_dir>

scripts/test_example.sh <code>

Workflow

1. Run Evals with Dev Models

2. Generate Summary and Identify Gaps

3. Analyze Error Patterns

4. Check Prompt Coverage

5. Test Examples Before Adding

6. Create Design Docs for Language Gaps

7. Track Improvement

Error Categories Reference

Resources

Gap Analysis Template

Common Patterns

Progressive Disclosure

Notes

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

`scripts/run_gap_analysis.sh [eval_dir]`

`scripts/identify_python_only.sh <eval_dir>`

`scripts/categorize_errors.sh <eval_dir>`

`scripts/test_example.sh <code>`