prompt-tuner

Name: prompt-tuner
Rating: 70
Author: wildcard

by wildcard

caro: fast Rust CLI that turns natural‑language tasks into a safe POSIX command. Built for macOS (MLX/Metal) with a built‑in model; supports vLLM/Ollama/LM Studio. JSON‑only output, safety checks, confirmation, multi‑step goals, devcontainer included.

⭐ 23🍴 2📅 Jan 23, 2026

ai cmd command-line devops local-llm posix runbook rust

View on GitHub Run in Manus

SKILL.md

name: prompt-tuner description: Improve embedded LLM system prompt based on evaluation test failures

Prompt Tuner

Iteratively improve the embedded backend's system prompt to increase command generation accuracy.

When to Use

"tune the prompt"
"improve embedded accuracy"
"fix LLM command generation"
"run prompt tuning cycle"
After evaluation tests show low accuracy

Workflow

Phase 1: Baseline Measurement

Run evaluation tests with embedded backend:

./target/release/caro test --backend embedded

Record:

Overall accuracy percentage
Category breakdown (Website Claim, Natural Variant, Edge Case)
List of failed test cases with expected vs actual commands

Phase 2: Failure Analysis

For each failed test case, identify the pattern:

Pattern	Example	Fix
Wrong path	`find /` instead of `find .`	Add rule: "ALWAYS use current directory '.'"
GNU flags	`--max-depth` on macOS	Add rule: "Use BSD-compatible flags"
Missing filters	No `-name "*.py"`	Add rule: "Include ALL relevant filters"
Time semantics	`-mtime -1` vs `-mtime 1`	Add clear mtime documentation
Quote style	Single vs double quotes	Usually equivalent, low priority
Flag order	`-type f -name` vs `-name -type f`	Usually equivalent, low priority

Phase 3: Prompt Improvement

Edit the system prompt in:

src/backends/embedded/embedded_backend.rs

Function: create_system_prompt()

Improvement strategies:

Add explicit rules for common failure patterns
Add concrete examples matching test cases
Use numbered rules with clear priorities
Keep prompt concise (LLMs perform worse with verbose prompts)

Phase 4: Verification

Build and re-run tests:

cargo build --release
./target/release/caro test --backend embedded

Compare results:

Did overall accuracy improve?
Which categories improved/regressed?
Are remaining failures semantic equivalents (flag order, quotes)?

Phase 5: Iterate or Commit

If accuracy improved significantly:

git add src/backends/embedded/embedded_backend.rs
git commit -m "feat(prompt): Improve embedded backend accuracy from X% to Y%"

If not improved or regressed:

Revert changes
Try different approach
Consider if test expectations need adjustment

Success Metrics

Level	Accuracy	Action
Poor	< 50%	Major prompt rewrite needed
Acceptable	50-70%	Targeted improvements
Good	70-85%	Minor tuning
Excellent	> 85%	Consider semantic equivalence in remaining failures

Tips

One change at a time: Make small, focused improvements
Check equivalents: Some "failures" are functionally equivalent commands
Test edge cases: Simple prompts often break on edge cases
BSD vs GNU: macOS uses BSD utilities, not GNU coreutils
Keep examples minimal: Too many examples can confuse small models

Example Session

User: /prompt-tuner

Score

Total Score

70/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

○Issue管理

オープンIssueが50未満

0/5

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

prompt-tuner

SKILL.md

name: prompt-tuner description: Improve embedded LLM system prompt based on evaluation test failures

Prompt Tuner

When to Use

Workflow

Phase 1: Baseline Measurement

Phase 2: Failure Analysis

Phase 3: Prompt Improvement

Phase 4: Verification

Phase 5: Iterate or Commit

Success Metrics

Tips

Example Session

Score

Reviews

create-pr

prompt-lookup

skill-lookup

orpc-contract-first

component-refactoring

web-design-guidelines

prompt-tuner

SKILL.md

name: prompt-tuner description: Improve embedded LLM system prompt based on evaluation test failures

Prompt Tuner

When to Use

Workflow

Phase 1: Baseline Measurement

Phase 2: Failure Analysis

Phase 3: Prompt Improvement

Phase 4: Verification

Phase 5: Iterate or Commit

Success Metrics

Tips

Example Session

Score

Reviews

Related

Related Skills

create-pr

prompt-lookup

skill-lookup

orpc-contract-first

component-refactoring

web-design-guidelines