Back to list
plaited

agent-eval-harness

by plaited

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

1🍴 0📅 Jan 24, 2026

SKILL.md


name: agent-eval-harness description: CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. compatibility: Bun >= 1.2.9

Agent Eval Harness

Purpose

CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun.

The harness captures. You score.

Harness ProvidesYou Provide
Prompt execution via headless adaptersScoring logic (Braintrust, custom scripts)
Full trajectory capture (thoughts, tools, plans)Pass/fail determination via graders
Structured JSONL outputLLM-as-judge prompts
Reproducible execution environmentCI integration, golden file comparison

Use this when:

  • Capturing trajectories for downstream evaluation
  • Generating training data (SFT/DPO) with full context
  • Building regression test fixtures for agent behavior
  • Comparing agent responses across configurations

Installation

# Run without installing (recommended)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Or install as project dependency
bun add @plaited/agent-eval-harness

Core Principle: Capture Once, Derive Many Views

flowchart LR
    Prompts["prompts.jsonl"] --> Capture["capture/trials"]
    Schema["headless schema"] --> Capture
    Capture --> Results["results.jsonl (full trajectory)"]
    Results --> Summarize["summarize"]
    Results --> Calibrate["calibrate"]
    Results --> Custom["(your tools)"]
    Summarize --> Views["summary.jsonl / .md"]
    Calibrate --> Report["calibration.md"]
    Custom --> Pipeline["any scoring platform"]

Single output format: Full trajectory JSONL (always) No --format flag: Derive views with separate commands Schema exports: Zod schemas + JSON Schema for any tooling

Commands

Core Commands

CommandInputOutputPurpose
captureprompts.jsonl + schemaresults.jsonlTrajectory capture (full)
trialsprompts.jsonl + schematrials.jsonlMulti-run + optional metrics
summarizeresults.jsonlsummary.jsonl or .mdDerive compact views
calibrateresults.jsonlcalibration.mdSample failures for review
validate-refsprompts.jsonlvalidation.jsonlCheck reference solutions
balanceprompts.jsonlbalance.jsonAnalyze test set coverage
schemas(none)JSON SchemaExport schemas for non-TS users

Pipeline Commands (Unix-style)

CommandInputOutputPurpose
runprompts.jsonl + schemaraw.jsonlExecute prompts, raw output
extractraw.jsonl + schemaextracted.jsonlParse trajectories
gradeextracted.jsonl + gradergraded.jsonlApply grader scoring
formatresults.jsonljsonl/markdown/csvConvert output format
comparemultiple results.jsonlcomparison.jsonCompare runs (aggregate report)

All commands support optional --grader ./grader.ts for scoring.

Capture Command

Basic Usage

bunx @plaited/agent-eval-harness capture <prompts.jsonl> --schema <schema.json> [options]

Arguments

Argument/FlagDescriptionDefault
prompts.jsonlInput file with prompts to executeRequired
-s, --schemaPath to headless adapter schemaRequired
-o, --outputOutput file/pathstdout
-c, --cwdWorking directory for agentcurrent
-t, --timeoutRequest timeout in ms60000
--progressShow progress to stderrfalse
--appendAppend to output filefalse
-g, --graderPath to grader modulenone
--debugShow detailed CLI output for debuggingfalse

Examples

# Basic capture
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Using a local adapter script
bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl

# With grader (adds score to each result)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl

Trials Command

Run each prompt multiple times for pass@k/pass^k analysis.

# Capture only (no grader)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl

# With grader (computes pass@k, pass^k)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl

Output

Without grader:

{"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]}

With grader:

{"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]}

Summarize Command

Derive compact views from full trajectory results.

# Summary JSONL (for jq analysis)
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Markdown (for LLM-as-judge)
bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md

Calibrate Command

Sample failures for grader review. Calibration helps you distinguish between agent failures (agent did wrong thing) and grader bugs (agent was correct, grader too strict).

# Sample failures for human review
bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md

# Re-score with different grader to compare
bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md

See eval-concepts.md for why calibration matters.

Validate-Refs Command

Check that reference solutions pass your grader before evaluating agents.

# Validate reference solutions
bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl

# Check for failures
cat validation.jsonl | jq 'select(.pass == false)'

Why Use This?

If your reference solution fails your own grader:

  • The task definition is ambiguous
  • The grader is too strict
  • The hint is wrong

Fix the eval before evaluating the agent.

Input Format

Prompts must include a reference field:

{"id":"test-001","input":"Create a button component","hint":"<button>","reference":"export const Button = () => <button>Click</button>"}

Output Format

{"id":"test-001","input":"Create a button component","reference":"export const Button = () => <button>Click</button>","pass":true,"score":1.0,"reasoning":"Contains hint content"}

Balance Command

Analyze test set coverage to ensure balanced evaluation.

# Analyze prompt distribution
bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json

# Pretty print
bunx @plaited/agent-eval-harness balance prompts.jsonl | jq .

Why Use This?

An eval with only "make X work" misses "don't break Y". Balance analysis shows:

  • Category distribution (from metadata.category)
  • Positive/negative case ratio
  • Coverage gaps

Output Format

{
  "totalCases": 50,
  "categories": [
    { "name": "ui", "count": 20, "percentage": 40 },
    { "name": "logic", "count": 15, "percentage": 30 },
    { "name": "api", "count": 10, "percentage": 20 },
    { "name": "edge-case", "count": 5, "percentage": 10 }
  ],
  "underrepresented": ["edge-case"],
  "suggestions": ["Consider adding more test cases for: edge-case"]
}

Balanced Eval Design

Include both positive and negative cases:

TypeExamplePurpose
Positive"Add a login button"Agent should succeed
Negative"Add a button without breaking tests"Agent should not break things
Edge case"Handle empty input gracefully"Agent should be robust

See eval-concepts.md for more on balanced test sets.

Pipeline Workflow

The pipeline commands enable Unix-style composition for flexible evaluation workflows.

Full Pipeline Example

# Execute → Extract → Grade → Format in one pipeline
cat prompts.jsonl | \
  bunx @plaited/agent-eval-harness run -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json | \
  bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
  bunx @plaited/agent-eval-harness format -f markdown > report.md

Run Command

Execute prompts and output raw results. Three modes available:

# Schema mode (recommended)
bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json

# Simple mode: {} placeholder substitution
bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json"

# Shell mode: $PROMPT environment variable
bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json'

⚠️ Security Warning: The --simple and --shell modes execute prompts via shell commands. Prompts are escaped but do not use untrusted prompt content with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use --schema mode (headless adapter) for untrusted inputs.

Extract Command

Parse raw output into structured trajectories:

# From file
bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl

# Piped from run
bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json

Grade Command

Apply grader to extracted results:

bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl

Format Command

Convert results to different output formats:

# Markdown report
bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md

# CSV for spreadsheets
bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv

# JSONL (pass-through, default)
bunx @plaited/agent-eval-harness format results.jsonl --style jsonl

Compare Command

Compare multiple runs of the same prompts and generate aggregate reports:

# Default: weighted strategy with JSON output
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Statistical significance strategy
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --strategy statistical -o comparison.json

# Custom weights via environment variables
COMPARE_QUALITY=0.7 COMPARE_LATENCY=0.2 COMPARE_RELIABILITY=0.1 \
  bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Markdown report format
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --format markdown -o report.md

# Custom grader (LLM-as-Judge)
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl \
  --strategy custom --grader ./my-llm-judge.ts -o comparison.json

# With explicit labels
bunx @plaited/agent-eval-harness compare \
  --run "with-mcp:results-mcp.jsonl" \
  --run "vanilla:results-vanilla.jsonl" \
  -o comparison.json

Use cases for compare:

  • Same agent, different MCP servers
  • Same agent, different skills enabled
  • Same agent, different model versions
  • Different agents entirely

Built-in Comparison Strategies

StrategyDescriptionWhen to Use
weighted (default)Configurable weights for quality, latency, reliabilityQuick comparisons
statisticalBootstrap sampling for confidence intervalsA/B testing with significance
customYour own LLM-as-Judge or logic-based graderSemantic evaluation

Comparison Report Output

The compare command outputs a holistic ComparisonReport JSON:

{
  "meta": { "generatedAt": "...", "runs": ["baseline", "variant"], "promptCount": 100 },
  "quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82 }, "variant": { ... } },
  "performance": { "baseline": { "latency": { "p50": 1200, "p90": 3400 } }, ... },
  "reliability": { "baseline": { "toolErrors": 5, "completionRate": 0.99 }, ... },
  "headToHead": { "pairwise": [{ "runA": "baseline", "runB": "variant", "aWins": 35, "bWins": 55 }] }
}

See comparison-graders.md for complete comparison grader documentation including LLM-as-Judge patterns.

Comparison Grader Interface

import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => {
  // runs is Record<string, { output, trajectory?, score?, duration?, toolErrors? }>
  // Return rankings from best to worst
  return {
    rankings: [
      { run: 'with-mcp', rank: 1, score: 0.9 },
      { run: 'vanilla', rank: 2, score: 0.7 },
    ],
    reasoning: 'MCP run produced more accurate output'
  }
}

Pipeline Workflow Diagram

flowchart LR
    Prompts["prompts.jsonl"] --> Run["run"]
    Schema["headless schema"] --> Run
    Run --> Raw["raw.jsonl"]
    Raw --> Extract["extract"]
    Schema --> Extract
    Extract --> Extracted["extracted.jsonl"]
    Extracted --> Grade["grade"]
    Grader["grader.ts"] --> Grade
    Grade --> Graded["graded.jsonl"]
    Graded --> Format["format"]
    Format --> Output["report.md / .csv / .jsonl"]

    Graded --> Compare["compare"]
    Results2["other runs..."] --> Compare
    CompareGrader["compare-grader.ts"] --> Compare
    Compare --> Comparison["comparison.jsonl"]

Schemas Command

Export JSON schemas for non-TypeScript tools.

# List available schemas
bunx @plaited/agent-eval-harness schemas

# Export all schemas as JSON
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Export specific schema
bunx @plaited/agent-eval-harness schemas CaptureResult --json
bunx @plaited/agent-eval-harness schemas TrialResult --json
bunx @plaited/agent-eval-harness schemas GraderResult --json

Available Schemas

SchemaDescription
CaptureResultSingle capture output (id, input, output, trajectory, timing)
TrialResultMulti-run trial output (includes passAtK, passExpK)
GraderResultGrader return value (pass, score, reasoning)
PromptInputInput prompt format
TrajectoryStepSingle step in trajectory array
SummaryResultCompact summary format

Usage in Other Languages

Export schemas for validation in Python, Go, etc.:

# Export all schemas
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Use in Python with jsonschema
python -c "
import json
from jsonschema import validate

with open('schemas.json') as f:
    schemas = json.load(f)

with open('results.jsonl') as f:
    for line in f:
        result = json.loads(line)
        validate(result, schemas['CaptureResult'])
        print(f'{result[\"id\"]}: valid')
"

Grader Interface

Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in any language.

TypeScript Grader

// my-grader.ts
import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ input, output, hint, trajectory }) => {
  const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')
  return {
    pass,
    score: pass ? 1 : 0,
    reasoning: pass ? 'Contains hint content' : 'Missing hint content'
  }
}

Note: input can be string (single turn) or string[] (multi-turn). The hint field provides grader context (renamed from expected).

Python/Executable Graders

Any executable can be a grader using stdin/stdout JSON protocol:

#!/usr/bin/env python3
import json, sys

data = json.load(sys.stdin)
output = data.get("output", "").lower()
hint = (data.get("hint") or "").lower()

pass_result = hint in output if hint else True
print(json.dumps({
    "pass": pass_result,
    "score": 1.0 if pass_result else 0.0,
    "reasoning": "Contains hint" if pass_result else "Missing hint"
}))
chmod +x ./grader.py
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl

See inline-graders.md for complete grader documentation including LLM-as-Judge patterns.

Input Format

Each line in prompts.jsonl:

{"id":"test-001","input":"Create a button","hint":"should contain <button>"}
{"id":"test-002","input":["Create a button","Make it blue"],"metadata":{"category":"ui"}}
FieldRequiredDescription
idYesUnique identifier
inputYesSingle prompt (string) or conversation turns (string[])
hintNoGrader context - what to look for (not strict match)
referenceNoReference solution (for validate-refs)
metadataNoTags, category, difficulty for filtering
timeoutNoOverride default timeout for this prompt

Session behavior: Each JSONL entry = 1 fresh session

  • input: string → 1 session, 1 prompt
  • input: string[] → 1 session, N prompts (sequential turns)

Output Format

Full trajectory JSONL (always):

{
  "id": "test-001",
  "input": "Find the CEO of Anthropic",
  "output": "The CEO of Anthropic is Dario Amodei.",
  "hint": "should mention Dario Amodei",
  "trajectory": [
    {"type": "thought", "content": "I'll search for this...", "timestamp": 100},
    {"type": "tool_call", "name": "WebSearch", "status": "completed", "input": {...}, "output": {...}, "duration": 500},
    {"type": "message", "content": "The CEO of Anthropic is Dario Amodei.", "timestamp": 700}
  ],
  "metadata": {
    "category": "search",
    "agent": "--schema ./claude.json",
    "trajectoryRichness": "full",
    "turnCount": 1
  },
  "timing": {
    "start": 1704067200000,
    "end": 1704067201234,
    "firstResponse": 100,
    "sessionCreation": 234,
    "total": 1234,
    "inputTokens": 150,
    "outputTokens": 85
  },
  "toolErrors": false
}

Output Fields

FieldDescription
inputOriginal prompt (string or string[] for multi-turn)
hintGrader context hint (if provided)
metadata.trajectoryRichness"full" | "messages-only" | "minimal"
metadata.turnCountNumber of conversation turns (1 for string, N for array)
timing.sessionCreationTime to create session (ms)
timing.totalTotal duration (end - start)
timing.inputTokensInput tokens consumed (if available from adapter)
timing.outputTokensOutput tokens generated (if available from adapter)
toolErrorsWhether any tool calls failed

Note: toolErrors replaces misleading status: 'passed'|'failed'. Real pass/fail comes from YOUR grader.

Schema Exports

Consumers can import Zod schemas directly:

import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas'

// Validate external data
const result = CaptureResultSchema.parse(jsonData)

// Generate JSON Schema (Zod 4 native)
import { z } from 'zod'
const jsonSchema = z.toJSONSchema(CaptureResultSchema)

Or export JSON schemas for non-TypeScript tools:

bunx @plaited/agent-eval-harness schemas --json -o schemas.json
bunx @plaited/agent-eval-harness schemas CaptureResult --json

Execution Environment

Recommendation: Run the harness in Docker containers for consistent, isolated execution.

# Run integration tests via Docker
docker compose -f docker-compose.test.yml run --rm test

# Or with explicit API keys
ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test

Docker Requirements

RequirementReason
Node.js 24+Gemini CLI uses modern JS features (optional chaining)
Non-root userClaude CLI blocks --dangerously-skip-permissions as root
Gemini API keyPass GEMINI_API_KEY for Gemini CLI

See docker-evals.md for complete Docker setup guide, debugging tips, and CI integration patterns.

Multi-turn Conversations

Use input: string[] to execute multi-turn conversations within a single session:

{"id":"context-001","input":["Remember this number: 42","What number did I ask you to remember?"],"hint":"42"}
{"id":"context-002","input":["My name is Alice","What is my name?"],"hint":"Alice"}

Run with the headless adapter:

# Using Claude Code via headless adapter
bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \
  -o results.jsonl

# Using Gemini CLI via headless adapter
GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \
  -o results.jsonl

Key points:

  • Each JSONL entry = 1 fresh session
  • input: string[] sends sequential turns to the same session
  • Works with both stream mode (Claude) and iterative mode (Gemini)
  • The adapter handles context preservation automatically

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add'

# Summarize for quick analysis
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Compare runs with built-in strategies
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Quick Reference

ResourceDescription
bunx @plaited/agent-eval-harnessCLI help
output-formats.mdJSONL schemas, command details
inline-graders.mdSingle input/output graders (TypeScript, Python, shell)
comparison-graders.mdComparison strategies (weighted, statistical, LLM-as-Judge)
calibration.mdGrader calibration workflow
eval-concepts.mdEvaluation concepts (pass@k, pass^k)
docker-evals.mdDocker setup, debugging, CI integration

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon