
view-results
by METR
Running UK AISI's Inspect in the Cloud
SKILL.md
name: view-results description: View and analyze Hawk evaluation results. Use when the user wants to see eval-set results, check evaluation status, list samples, view transcripts, or analyze agent behavior from a completed evaluation run.
View Hawk Eval Results
When the user wants to analyze evaluation results, use these hawk CLI commands:
1. List Eval Sets
You can list all eval sets if the user do not know the eval set ID:
hawk list eval-sets
Shows: eval set ID, creation date, creator.
You can increase the limit of results returned by --limit N.
hawk list eval-sets --limit 50
Or you can search for a specific eval set by using --search QUERY.
hawk list eval-sets --search pico
2. List Evaluations
With an eval set ID, you can list all evaluations in the eval-set:
hawk list evals [EVAL_SET_ID]
Shows: task name, model, status (success/error/cancelled), and sample counts.
3. List Samples
Or you can list individual samples and their scores:
hawk list samples [EVAL_SET_ID] [--eval FILE] [--limit N]
4. Download Transcript
To get the full conversation for a specific sample:
hawk transcript <UUID>
The transcript includes full conversation with tool calls, scores, and metadata.
To get even more details, you can get the raw data by using --raw:
hawk transcript <UUID> --raw
Batch Transcript Download
You can also download all transcripts for an entire eval set:
# Fetch all samples in an eval set
hawk transcripts <EVAL_SET_ID>
# Write to individual files in a directory
hawk transcripts <EVAL_SET_ID> --output-dir ./transcripts
# Limit number of samples
hawk transcripts <EVAL_SET_ID> --limit 10
# Raw JSON output (one JSON per line to stdout, or .json files with --output-dir)
hawk transcripts <EVAL_SET_ID> --raw
Workflow
- Run
hawk list eval-setsto see available eval sets 2a. Runhawk list evals <EVAL_SET_ID>to see available evaluations 2b. or runhawk list samples <EVAL_SET_ID>to find samples of interest 3a. Runhawk transcript <uuid>to get full details on a single sample 3b. or runhawk transcripts <eval_set_id> --output-dir ./transcriptsto download all - Read and analyze the transcript(s) to understand the agent's behavior
API Environments
Production (https://api.inspect-ai.internal.metr.org) is used by default. Set HAWK_API_URL only when targeting non-production environments:
| Environment | URL |
|---|---|
| Staging | https://api.inspect-ai.staging.metr-dev.org |
| Dev1 | https://api.inspect-ai.dev1.staging.metr-dev.org |
| Dev2 | https://api.inspect-ai.dev2.staging.metr-dev.org |
| Dev3 | https://api.inspect-ai.dev3.staging.metr-dev.org |
| Dev4 | https://api.inspect-ai.dev4.staging.metr-dev.org |
Example:
HAWK_API_URL=https://api.inspect-ai.staging.metr-dev.org hawk list eval_sets
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon


