Back to list
yonatangross

golden-dataset-curation

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 Jan 23, 2026

SKILL.md


name: golden-dataset-curation description: Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data. context: fork agent: data-pipeline-engineer version: 1.0.0 author: OrchestKit AI Agent Hub tags: [golden-dataset, curation, quality, multi-agent, langfuse, 2025] user-invocable: false

Golden Dataset Curation

Curate high-quality documents for the golden dataset with multi-agent validation

Overview

This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements golden-dataset-management which handles backup/restore.

When to use this skill:

  • Adding new documents to the golden dataset
  • Classifying content types and difficulty levels
  • Generating test queries for new documents
  • Running multi-agent quality analysis

Content Types

TypeDescriptionQuality Focus
articleTechnical articles, blog postsDepth, accuracy, actionability
tutorialStep-by-step guidesCompleteness, clarity, code quality
research_paperAcademic papers, whitepapersRigor, citations, methodology
documentationAPI docs, reference materialsAccuracy, completeness, examples
video_transcriptTranscribed video contentStructure, coherence, key points
code_repositoryREADME, code analysisCode quality, documentation

Difficulty Levels

LevelSemantic ComplexityExpected ScoreCharacteristics
trivialDirect keyword match>0.85Technical terms, exact phrases
easyCommon synonyms>0.70Well-known concepts, slight variations
mediumParaphrased intent>0.55Conceptual queries, multi-topic
hardMulti-hop reasoning>0.40Cross-domain, comparative analysis
adversarialEdge casesGraceful degradationRobustness tests, off-domain

Quality Dimensions

DimensionWeightPerfectAcceptableFailing
Accuracy0.250.95-1.00.70-0.94<0.70
Coherence0.200.90-1.00.60-0.89<0.60
Depth0.250.90-1.00.55-0.89<0.55
Relevance0.300.95-1.00.70-0.94<0.70

Evaluation focuses:

  • Accuracy: Technical correctness, code validity, up-to-date info
  • Coherence: Logical structure, clear flow, consistent terminology
  • Depth: Comprehensive coverage, edge cases, appropriate detail
  • Relevance: Alignment with AI/ML, backend, frontend, DevOps domains

Multi-Agent Pipeline

INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry

Decision Thresholds

Quality ScoreConfidenceDecision
>= 0.75>= 0.70include
>= 0.55anyreview
< 0.55anyexclude

Quality Thresholds

# Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2          # At least 2 domain tags
required_queries: 3       # At least 3 test queries

Coverage Balance Guidelines

Maintain balanced coverage across:

  • Content types: Don't over-index on articles
  • Difficulty levels: Need trivial AND hard queries
  • Domains: Spread across AI/ML, backend, frontend, etc.

Duplicate Prevention Checklist

Before adding:

  1. Check URL against existing source_url_map.json
  2. Run semantic similarity against existing document embeddings
  3. Warn if >80% similar to existing document

Provenance Tracking

Always record:

  • Source URL (canonical)
  • Curation date
  • Agent scores (for audit trail)
  • Langfuse trace ID

Langfuse Integration

Trace Structure

trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)

# Log individual dimension scores
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)

# Final aggregated score
trace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})

Managed Prompts

Prompt NamePurpose
golden-content-classifierClassify content_type
golden-difficulty-classifierAssign difficulty
golden-domain-taggerExtract tags
golden-query-generatorGenerate test queries

References

For detailed implementation patterns, see:

  • references/selection-criteria.md - Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
  • references/annotation-patterns.md - Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration

  • golden-dataset-management - Backup/restore operations
  • golden-dataset-validation - Validation rules and checks
  • langfuse-observability - Tracing patterns
  • pgvector-search - Duplicate detection

Version: 1.0.0 (December 2025) Issue: #599

Capability Details

content-classification

Keywords: content type, classification, document type, golden dataset Solves:

  • Classify document content types for golden dataset
  • Categorize entries by domain and purpose
  • Identify content requiring special handling

difficulty-stratification

Keywords: difficulty, stratification, complexity level, challenge rating Solves:

  • Assign difficulty levels to golden dataset entries
  • Ensure balanced difficulty distribution
  • Identify edge cases and challenging examples

quality-evaluation

Keywords: quality, evaluation, quality dimensions, quality criteria Solves:

  • Evaluate entry quality against defined criteria
  • Score entries on multiple quality dimensions
  • Identify entries needing improvement

multi-agent-analysis

Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:

  • Run parallel agent evaluations on entries
  • Aggregate consensus from multiple analysts
  • Resolve disagreements in classifications

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon