Back to list
wildcard

prompt-tuner

by wildcard

caro: fast Rust CLI that turns natural‑language tasks into a safe POSIX command. Built for macOS (MLX/Metal) with a built‑in model; supports vLLM/Ollama/LM Studio. JSON‑only output, safety checks, confirmation, multi‑step goals, devcontainer included.

23🍴 2📅 Jan 23, 2026

SKILL.md


name: prompt-tuner description: Improve embedded LLM system prompt based on evaluation test failures

Prompt Tuner

Iteratively improve the embedded backend's system prompt to increase command generation accuracy.

When to Use

  • "tune the prompt"
  • "improve embedded accuracy"
  • "fix LLM command generation"
  • "run prompt tuning cycle"
  • After evaluation tests show low accuracy

Workflow

Phase 1: Baseline Measurement

Run evaluation tests with embedded backend:

./target/release/caro test --backend embedded

Record:

  • Overall accuracy percentage
  • Category breakdown (Website Claim, Natural Variant, Edge Case)
  • List of failed test cases with expected vs actual commands

Phase 2: Failure Analysis

For each failed test case, identify the pattern:

PatternExampleFix
Wrong pathfind / instead of find .Add rule: "ALWAYS use current directory '.'"
GNU flags--max-depth on macOSAdd rule: "Use BSD-compatible flags"
Missing filtersNo -name "*.py"Add rule: "Include ALL relevant filters"
Time semantics-mtime -1 vs -mtime 1Add clear mtime documentation
Quote styleSingle vs double quotesUsually equivalent, low priority
Flag order-type f -name vs -name -type fUsually equivalent, low priority

Phase 3: Prompt Improvement

Edit the system prompt in:

src/backends/embedded/embedded_backend.rs

Function: create_system_prompt()

Improvement strategies:

  1. Add explicit rules for common failure patterns
  2. Add concrete examples matching test cases
  3. Use numbered rules with clear priorities
  4. Keep prompt concise (LLMs perform worse with verbose prompts)

Phase 4: Verification

Build and re-run tests:

cargo build --release
./target/release/caro test --backend embedded

Compare results:

  • Did overall accuracy improve?
  • Which categories improved/regressed?
  • Are remaining failures semantic equivalents (flag order, quotes)?

Phase 5: Iterate or Commit

If accuracy improved significantly:

git add src/backends/embedded/embedded_backend.rs
git commit -m "feat(prompt): Improve embedded backend accuracy from X% to Y%"

If not improved or regressed:

  • Revert changes
  • Try different approach
  • Consider if test expectations need adjustment

Success Metrics

LevelAccuracyAction
Poor< 50%Major prompt rewrite needed
Acceptable50-70%Targeted improvements
Good70-85%Minor tuning
Excellent> 85%Consider semantic equivalence in remaining failures

Tips

  1. One change at a time: Make small, focused improvements
  2. Check equivalents: Some "failures" are functionally equivalent commands
  3. Test edge cases: Simple prompts often break on edge cases
  4. BSD vs GNU: macOS uses BSD utilities, not GNU coreutils
  5. Keep examples minimal: Too many examples can confuse small models

Example Session

User: /prompt-tuner

Score

Total Score

70/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

0/5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon