
golden-dataset-curation
by yonatangross
The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.
SKILL.md
name: golden-dataset-curation description: Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data. context: fork agent: data-pipeline-engineer version: 1.0.0 author: OrchestKit AI Agent Hub tags: [golden-dataset, curation, quality, multi-agent, langfuse, 2025] user-invocable: false
Golden Dataset Curation
Curate high-quality documents for the golden dataset with multi-agent validation
Overview
This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements golden-dataset-management which handles backup/restore.
When to use this skill:
- Adding new documents to the golden dataset
- Classifying content types and difficulty levels
- Generating test queries for new documents
- Running multi-agent quality analysis
Content Types
| Type | Description | Quality Focus |
|---|---|---|
article | Technical articles, blog posts | Depth, accuracy, actionability |
tutorial | Step-by-step guides | Completeness, clarity, code quality |
research_paper | Academic papers, whitepapers | Rigor, citations, methodology |
documentation | API docs, reference materials | Accuracy, completeness, examples |
video_transcript | Transcribed video content | Structure, coherence, key points |
code_repository | README, code analysis | Code quality, documentation |
Difficulty Levels
| Level | Semantic Complexity | Expected Score | Characteristics |
|---|---|---|---|
| trivial | Direct keyword match | >0.85 | Technical terms, exact phrases |
| easy | Common synonyms | >0.70 | Well-known concepts, slight variations |
| medium | Paraphrased intent | >0.55 | Conceptual queries, multi-topic |
| hard | Multi-hop reasoning | >0.40 | Cross-domain, comparative analysis |
| adversarial | Edge cases | Graceful degradation | Robustness tests, off-domain |
Quality Dimensions
| Dimension | Weight | Perfect | Acceptable | Failing |
|---|---|---|---|---|
| Accuracy | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 |
| Coherence | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 |
| Depth | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 |
| Relevance | 0.30 | 0.95-1.0 | 0.70-0.94 | <0.70 |
Evaluation focuses:
- Accuracy: Technical correctness, code validity, up-to-date info
- Coherence: Logical structure, clear flow, consistent terminology
- Depth: Comprehensive coverage, edge cases, appropriate detail
- Relevance: Alignment with AI/ML, backend, frontend, DevOps domains
Multi-Agent Pipeline
INPUT: URL/Content
|
v
+------------------+
| FETCH AGENT | Extract structure, detect type
+--------+---------+
|
v
+-----------------------------------------------+
| PARALLEL ANALYSIS AGENTS |
| Quality | Difficulty | Domain | Query Gen |
+-----------------------------------------------+
|
v
+------------------+
| CONSENSUS | Weighted score + confidence
| AGGREGATOR | -> include/review/exclude
+--------+---------+
|
v
+------------------+
| USER APPROVAL | Show scores, confirm
+--------+---------+
|
v
OUTPUT: Curated document entry
Decision Thresholds
| Quality Score | Confidence | Decision |
|---|---|---|
| >= 0.75 | >= 0.70 | include |
| >= 0.55 | any | review |
| < 0.55 | any | exclude |
Quality Thresholds
# Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2 # At least 2 domain tags
required_queries: 3 # At least 3 test queries
Coverage Balance Guidelines
Maintain balanced coverage across:
- Content types: Don't over-index on articles
- Difficulty levels: Need trivial AND hard queries
- Domains: Spread across AI/ML, backend, frontend, etc.
Duplicate Prevention Checklist
Before adding:
- Check URL against existing
source_url_map.json - Run semantic similarity against existing document embeddings
- Warn if >80% similar to existing document
Provenance Tracking
Always record:
- Source URL (canonical)
- Curation date
- Agent scores (for audit trail)
- Langfuse trace ID
Langfuse Integration
Trace Structure
trace = langfuse.trace(
name="golden-dataset-curation",
metadata={"source_url": url, "document_id": doc_id}
)
# Log individual dimension scores
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)
# Final aggregated score
trace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})
Managed Prompts
| Prompt Name | Purpose |
|---|---|
golden-content-classifier | Classify content_type |
golden-difficulty-classifier | Assign difficulty |
golden-domain-tagger | Extract tags |
golden-query-generator | Generate test queries |
References
For detailed implementation patterns, see:
references/selection-criteria.md- Content type classification, difficulty stratification, quality evaluation dimensions, and best practicesreferences/annotation-patterns.md- Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration
Related Skills
golden-dataset-management- Backup/restore operationsgolden-dataset-validation- Validation rules and checkslangfuse-observability- Tracing patternspgvector-search- Duplicate detection
Version: 1.0.0 (December 2025) Issue: #599
Capability Details
content-classification
Keywords: content type, classification, document type, golden dataset Solves:
- Classify document content types for golden dataset
- Categorize entries by domain and purpose
- Identify content requiring special handling
difficulty-stratification
Keywords: difficulty, stratification, complexity level, challenge rating Solves:
- Assign difficulty levels to golden dataset entries
- Ensure balanced difficulty distribution
- Identify edge cases and challenging examples
quality-evaluation
Keywords: quality, evaluation, quality dimensions, quality criteria Solves:
- Evaluate entry quality against defined criteria
- Score entries on multiple quality dimensions
- Identify entries needing improvement
multi-agent-analysis
Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:
- Run parallel agent evaluations on entries
- Aggregate consensus from multiple analysts
- Resolve disagreements in classifications
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
