
prompt-tuner
by wildcard
caro: fast Rust CLI that turns natural‑language tasks into a safe POSIX command. Built for macOS (MLX/Metal) with a built‑in model; supports vLLM/Ollama/LM Studio. JSON‑only output, safety checks, confirmation, multi‑step goals, devcontainer included.
SKILL.md
name: prompt-tuner description: Improve embedded LLM system prompt based on evaluation test failures
Prompt Tuner
Iteratively improve the embedded backend's system prompt to increase command generation accuracy.
When to Use
- "tune the prompt"
- "improve embedded accuracy"
- "fix LLM command generation"
- "run prompt tuning cycle"
- After evaluation tests show low accuracy
Workflow
Phase 1: Baseline Measurement
Run evaluation tests with embedded backend:
./target/release/caro test --backend embedded
Record:
- Overall accuracy percentage
- Category breakdown (Website Claim, Natural Variant, Edge Case)
- List of failed test cases with expected vs actual commands
Phase 2: Failure Analysis
For each failed test case, identify the pattern:
| Pattern | Example | Fix |
|---|---|---|
| Wrong path | find / instead of find . | Add rule: "ALWAYS use current directory '.'" |
| GNU flags | --max-depth on macOS | Add rule: "Use BSD-compatible flags" |
| Missing filters | No -name "*.py" | Add rule: "Include ALL relevant filters" |
| Time semantics | -mtime -1 vs -mtime 1 | Add clear mtime documentation |
| Quote style | Single vs double quotes | Usually equivalent, low priority |
| Flag order | -type f -name vs -name -type f | Usually equivalent, low priority |
Phase 3: Prompt Improvement
Edit the system prompt in:
src/backends/embedded/embedded_backend.rs
Function: create_system_prompt()
Improvement strategies:
- Add explicit rules for common failure patterns
- Add concrete examples matching test cases
- Use numbered rules with clear priorities
- Keep prompt concise (LLMs perform worse with verbose prompts)
Phase 4: Verification
Build and re-run tests:
cargo build --release
./target/release/caro test --backend embedded
Compare results:
- Did overall accuracy improve?
- Which categories improved/regressed?
- Are remaining failures semantic equivalents (flag order, quotes)?
Phase 5: Iterate or Commit
If accuracy improved significantly:
git add src/backends/embedded/embedded_backend.rs
git commit -m "feat(prompt): Improve embedded backend accuracy from X% to Y%"
If not improved or regressed:
- Revert changes
- Try different approach
- Consider if test expectations need adjustment
Success Metrics
| Level | Accuracy | Action |
|---|---|---|
| Poor | < 50% | Major prompt rewrite needed |
| Acceptable | 50-70% | Targeted improvements |
| Good | 70-85% | Minor tuning |
| Excellent | > 85% | Consider semantic equivalence in remaining failures |
Tips
- One change at a time: Make small, focused improvements
- Check equivalents: Some "failures" are functionally equivalent commands
- Test edge cases: Simple prompts often break on edge cases
- BSD vs GNU: macOS uses BSD utilities, not GNU coreutils
- Keep examples minimal: Too many examples can confuse small models
Example Session
User: /prompt-tuner
スコア
総合スコア
リポジトリの品質指標に基づく評価
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
レビュー
レビュー機能は近日公開予定です


