
dataset-engineering
by ScientiaCapital
MCP server for LLM fine-tuning with Unsloth. 33 tools, 180 tests, RunPod GPU integration. Fine-tune 2x faster with 80% less memory.
SKILL.md
name: dataset-engineering description: Create, clean, and optimize datasets for LLM fine-tuning. Covers formats (Alpaca, ShareGPT, ChatML), synthetic data generation, quality assessment, and augmentation. Use when preparing data for training.
Dataset Engineering
Complete guide for creating, cleaning, and optimizing datasets for LLM fine-tuning.
Overview
Quality data >> model size. This skill covers:
- Dataset formats - Alpaca, ShareGPT, ChatML, custom
- Data generation - Synthetic data with Claude/GPT-4
- Cleaning & filtering - Remove noise, duplicates, low-quality
- Augmentation - Expand datasets effectively
- Quality assessment - Measure and improve data quality
- Splitting strategies - Train/val/test splits
- HuggingFace integration - Load, transform, upload datasets
Quick Start
Format Existing Data (Alpaca)
# Convert your data to Alpaca format
data = [
{
"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."
},
{
"instruction": "Translate to Spanish",
"input": "Hello, how are you?",
"output": "Hola, ¿cómo estás?"
}
]
import json
with open("dataset.json", "w") as f:
json.dump(data, f, indent=2)
Load and Use with Unsloth
from datasets import load_dataset
from unsloth import FastLanguageModel, standardize_sharegpt
# Load dataset
dataset = load_dataset("json", data_files="dataset.json", split="train")
# Format for training
def formatting_func(examples):
texts = []
for instruction, input_text, output in zip(
examples["instruction"],
examples["input"],
examples["output"]
):
text = f"### Instruction:\n{instruction}\n\n"
if input_text:
text += f"### Input:\n{input_text}\n\n"
text += f"### Response:\n{output}"
texts.append(text)
return {"text": texts}
dataset = dataset.map(formatting_func, batched=True)
Generate Synthetic Data
import anthropic
client = anthropic.Anthropic(api_key="sk-...")
def generate_training_examples(topic: str, num_examples: int = 10):
"""Generate synthetic training data using Claude"""
prompt = f"""Generate {num_examples} high-quality question-answer pairs about {topic}.
Format each as JSON:
{{
"instruction": "The question or task",
"input": "",
"output": "The detailed answer"
}}
Make answers informative, accurate, and varied in style."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
messages=[{"role": "user", "content": prompt}]
)
# Parse JSON from response
return parse_json_examples(response.content[0].text)
# Generate medical Q&A data
medical_data = generate_training_examples("medical diagnosis", num_examples=100)
Dataset Formats
1. Alpaca Format
Best for: Instruction following, Q&A
alpaca_format = {
"instruction": "The task or question",
"input": "Optional context or input",
"output": "The desired response"
}
# Example
{
"instruction": "Explain photosynthesis",
"input": "",
"output": "Photosynthesis is the process by which plants..."
}
# With input field
{
"instruction": "Summarize the following text",
"input": "Long text here...",
"output": "Summary here..."
}
2. ShareGPT Format
Best for: Multi-turn conversations, chat models
sharegpt_format = {
"conversations": [
{"from": "human", "value": "Hello!"},
{"from": "gpt", "value": "Hi! How can I help?"},
{"from": "human", "value": "What's 2+2?"},
{"from": "gpt", "value": "2+2 equals 4."}
]
}
# Use with Unsloth
from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
3. ChatML Format
Best for: OpenAI-style chat, system prompts
chatml_format = {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is AI?"},
{"role": "assistant", "content": "AI stands for Artificial Intelligence..."}
]
}
# Convert to training format
def chatml_to_text(example):
text = ""
for msg in example["messages"]:
if msg["role"] == "system":
text += f"<|system|>\n{msg['content']}\n"
elif msg["role"] == "user":
text += f"<|user|>\n{msg['content']}\n"
elif msg["role"] == "assistant":
text += f"<|assistant|>\n{msg['content']}\n"
return {"text": text}
4. Custom Domain Format
Best for: Specialized tasks, domain-specific
# Medical diagnosis format
medical_format = {
"patient_symptoms": "Fever, cough, fatigue",
"medical_history": "No prior conditions",
"vital_signs": "Temp: 101°F, BP: 120/80",
"diagnosis": "Likely viral infection",
"treatment_plan": "Rest, fluids, monitor for 48 hours"
}
# Legal document format
legal_format = {
"document_type": "Contract Review",
"clauses": ["Clause 1...", "Clause 2..."],
"issues_found": ["Issue 1", "Issue 2"],
"recommendations": ["Recommendation 1", "Recommendation 2"]
}
Synthetic Data Generation
Generate with Claude
import anthropic
import json
class SyntheticDataGenerator:
def __init__(self, api_key: str):
self.client = anthropic.Anthropic(api_key=api_key)
def generate_qa_pairs(
self,
domain: str,
num_examples: int,
difficulty: str = "mixed"
):
"""Generate Q&A pairs for a specific domain"""
prompt = f"""Generate {num_examples} diverse question-answer pairs about {domain}.
Requirements:
- Difficulty: {difficulty}
- Varied question types (what, how, why, compare, analyze)
- Detailed, accurate answers (100-300 words)
- Cover different aspects of {domain}
Output as JSON array:
[
{{
"instruction": "question here",
"input": "",
"output": "detailed answer here"
}}
]"""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8000,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)
def generate_conversations(
self,
scenario: str,
num_conversations: int,
turns_per_conversation: int = 4
):
"""Generate multi-turn conversations"""
prompt = f"""Generate {num_conversations} realistic conversations for: {scenario}
Each conversation should have {turns_per_conversation} turns (back-and-forth).
Format as JSON:
[
{{
"conversations": [
{{"from": "human", "value": "..."}},
{{"from": "gpt", "value": "..."}},
...
]
}}
]
Make conversations natural, varied, and realistic."""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=8000,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)
# Usage
generator = SyntheticDataGenerator("sk-...")
# Generate 100 medical Q&A pairs
medical_data = generator.generate_qa_pairs(
domain="medical diagnosis",
num_examples=100,
difficulty="mixed"
)
# Generate customer support conversations
support_data = generator.generate_conversations(
scenario="technical support for a SaaS product",
num_conversations=50,
turns_per_conversation=6
)
Generate with GPT-4
import openai
def generate_with_gpt4(domain: str, num_examples: int):
"""Generate training data with GPT-4"""
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[{
"role": "user",
"content": f"""Generate {num_examples} training examples for {domain}.
Output as JSON array with instruction/input/output format."""
}],
temperature=0.8 # Higher for diversity
)
return json.loads(response.choices[0].message.content)
Self-Instruct Method
Generate data from existing data:
def self_instruct_augmentation(base_examples: list, multiplier: int = 3):
"""
Generate variations of existing examples using Claude
"""
augmented = []
for example in base_examples:
prompt = f"""Given this training example:
Instruction: {example['instruction']}
Output: {example['output']}
Generate {multiplier} similar but distinct examples that:
1. Cover the same concept
2. Use different wording
3. Vary in complexity
4. Include different examples
Output as JSON array."""
# Call Claude to generate variations
variations = call_claude(prompt)
augmented.extend(variations)
return augmented
Data Cleaning & Filtering
Remove Low-Quality Examples
def filter_quality(dataset):
"""Filter out low-quality examples"""
def is_high_quality(example):
instruction = example["instruction"]
output = example["output"]
# Too short
if len(output) < 20:
return False
# Too long (likely copy-paste dumps)
if len(output) > 2000:
return False
# No instruction
if not instruction or len(instruction) < 5:
return False
# Output is just "I don't know" or similar
low_quality_responses = [
"i don't know",
"not sure",
"no idea",
"cannot answer"
]
if output.lower().strip() in low_quality_responses:
return False
# Instruction and output are identical (copy-paste error)
if instruction.strip() == output.strip():
return False
return True
return dataset.filter(is_high_quality)
Remove Duplicates
from collections import defaultdict
def remove_duplicates(dataset):
"""Remove duplicate or near-duplicate examples"""
seen = set()
unique_examples = []
for example in dataset:
# Create hash of instruction + output
content = f"{example['instruction']}|||{example['output']}"
content_hash = hash(content)
if content_hash not in seen:
seen.add(content_hash)
unique_examples.append(example)
print(f"Removed {len(dataset) - len(unique_examples)} duplicates")
return unique_examples
# Near-duplicate detection (fuzzy matching)
from difflib import SequenceMatcher
def remove_near_duplicates(dataset, similarity_threshold=0.9):
"""Remove examples that are too similar"""
unique = []
for example in dataset:
is_duplicate = False
for unique_example in unique:
# Compare instruction similarity
similarity = SequenceMatcher(
None,
example["instruction"],
unique_example["instruction"]
).ratio()
if similarity > similarity_threshold:
is_duplicate = True
break
if not is_duplicate:
unique.append(example)
print(f"Removed {len(dataset) - len(unique)} near-duplicates")
return unique
Filter by Language
from langdetect import detect
def filter_by_language(dataset, target_language="en"):
"""Keep only examples in target language"""
def is_target_language(example):
try:
# Check both instruction and output
inst_lang = detect(example["instruction"])
out_lang = detect(example["output"])
return inst_lang == target_language and out_lang == target_language
except:
return False # If detection fails, exclude
return dataset.filter(is_target_language)
Remove PII (Personal Information)
import re
def remove_pii(text: str) -> str:
"""Remove personally identifiable information"""
# Email addresses
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
# Phone numbers (US format)
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# Social Security Numbers
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
# Credit card numbers
text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)
return text
def sanitize_dataset(dataset):
"""Remove PII from entire dataset"""
def sanitize_example(example):
return {
"instruction": remove_pii(example["instruction"]),
"input": remove_pii(example.get("input", "")),
"output": remove_pii(example["output"])
}
return dataset.map(sanitize_example)
Data Augmentation
Paraphrase Augmentation
def paraphrase_augmentation(dataset, augmentation_factor=2):
"""Generate paraphrased versions of examples"""
augmented = []
for example in dataset:
# Keep original
augmented.append(example)
# Generate paraphrases
for i in range(augmentation_factor - 1):
paraphrased = generate_paraphrase(example)
augmented.append(paraphrased)
return augmented
def generate_paraphrase(example):
"""Use Claude to paraphrase an example"""
prompt = f"""Paraphrase this training example while keeping the same meaning:
Instruction: {example['instruction']}
Output: {example['output']}
Provide a paraphrased version with:
- Different wording
- Same core meaning
- Natural language
Output as JSON with instruction and output fields."""
# Call Claude
response = call_claude(prompt)
return json.loads(response)
Back-Translation
from googletrans import Translator
def back_translation_augmentation(text: str, intermediate_lang='es'):
"""Augment via back-translation (English -> Spanish -> English)"""
translator = Translator()
# Translate to intermediate language
intermediate = translator.translate(text, dest=intermediate_lang).text
# Translate back to English
back_translated = translator.translate(intermediate, dest='en').text
return back_translated
# Apply to dataset
def augment_via_back_translation(dataset):
augmented = []
for example in dataset:
# Original
augmented.append(example)
# Back-translated version
augmented.append({
"instruction": back_translation_augmentation(example["instruction"]),
"input": example.get("input", ""),
"output": back_translation_augmentation(example["output"])
})
return augmented
Difficulty Variation
def vary_difficulty(example, target_difficulty: str):
"""Generate easier or harder versions of an example"""
prompt = f"""Given this example:
Instruction: {example['instruction']}
Output: {example['output']}
Create a {target_difficulty} version:
- Easier: Simplify concepts, use basic language, shorter
- Harder: Add complexity, technical terms, deeper analysis
Output as JSON."""
response = call_claude(prompt)
return json.loads(response)
# Generate difficulty variations
def create_difficulty_variants(dataset):
augmented = []
for example in dataset:
# Original (medium)
augmented.append(example)
# Easier version
augmented.append(vary_difficulty(example, "easier"))
# Harder version
augmented.append(vary_difficulty(example, "harder"))
return augmented
Quality Assessment
Automated Quality Scoring
class QualityScorer:
def __init__(self):
self.criteria = {
"length": (50, 500), # Ideal output length
"instruction_length": (10, 200),
"readability": 60, # Flesch reading ease
"coherence": 0.7, # Sentence similarity
}
def score_example(self, example):
"""Score an example on multiple criteria"""
scores = {}
# Length score
output_len = len(example["output"])
min_len, max_len = self.criteria["length"]
if min_len <= output_len <= max_len:
scores["length"] = 1.0
else:
scores["length"] = max(0, 1 - abs(output_len - (min_len + max_len) / 2) / max_len)
# Instruction quality
inst_len = len(example["instruction"])
inst_min, inst_max = self.criteria["instruction_length"]
scores["instruction"] = 1.0 if inst_min <= inst_len <= inst_max else 0.5
# Has actual content (not just generic responses)
generic_responses = ["i don't know", "not sure", "maybe"]
if any(gr in example["output"].lower() for gr in generic_responses):
scores["content"] = 0.3
else:
scores["content"] = 1.0
# Overall score
overall = sum(scores.values()) / len(scores)
return overall, scores
def filter_by_quality(self, dataset, min_score=0.7):
"""Keep only high-quality examples"""
filtered = []
for example in dataset:
score, _ = self.score_example(example)
if score >= min_score:
filtered.append(example)
print(f"Kept {len(filtered)}/{len(dataset)} examples (score >= {min_score})")
return filtered
# Usage
scorer = QualityScorer()
high_quality_data = scorer.filter_by_quality(dataset, min_score=0.75)
Human-in-the-Loop Validation
def create_validation_interface(dataset, sample_size=100):
"""Sample dataset for human review"""
import random
sample = random.sample(dataset, min(sample_size, len(dataset)))
print("Review these examples. Rate 1-5:")
print("1 = Poor, 2 = Below Average, 3 = Average, 4 = Good, 5 = Excellent\n")
ratings = []
for i, example in enumerate(sample):
print(f"\n--- Example {i+1}/{len(sample)} ---")
print(f"Instruction: {example['instruction']}")
print(f"Output: {example['output'][:200]}...")
rating = int(input("Rating (1-5): "))
ratings.append(rating)
avg_rating = sum(ratings) / len(ratings)
print(f"\nAverage rating: {avg_rating:.2f}")
# Identify issues
low_rated = [sample[i] for i, r in enumerate(ratings) if r <= 2]
print(f"Low-rated examples: {len(low_rated)}")
return avg_rating, low_rated
Train/Val/Test Splits
Basic Split
from datasets import Dataset
def split_dataset(data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
"""Split dataset into train/val/test"""
assert train_ratio + val_ratio + test_ratio == 1.0
total = len(data)
train_size = int(total * train_ratio)
val_size = int(total * val_ratio)
# Shuffle
import random
random.shuffle(data)
# Split
train_data = data[:train_size]
val_data = data[train_size:train_size + val_size]
test_data = data[train_size + val_size:]
return {
"train": Dataset.from_list(train_data),
"validation": Dataset.from_list(val_data),
"test": Dataset.from_list(test_data)
}
Stratified Split
For imbalanced datasets:
from sklearn.model_selection import train_test_split
def stratified_split(data, category_field="category"):
"""Split while preserving category distribution"""
# Extract categories
categories = [example[category_field] for example in data]
# Stratified split
train_data, temp_data, train_cats, temp_cats = train_test_split(
data, categories,
test_size=0.2,
stratify=categories,
random_state=42
)
val_data, test_data = train_test_split(
temp_data,
test_size=0.5,
stratify=temp_cats,
random_state=42
)
return {
"train": train_data,
"validation": val_data,
"test": test_data
}
Time-Based Split
For temporal data:
def time_based_split(data, timestamp_field="timestamp"):
"""Split based on timestamp to avoid data leakage"""
# Sort by timestamp
sorted_data = sorted(data, key=lambda x: x[timestamp_field])
# Use oldest 80% for train, next 10% for val, newest 10% for test
total = len(sorted_data)
train_end = int(total * 0.8)
val_end = int(total * 0.9)
return {
"train": sorted_data[:train_end],
"validation": sorted_data[train_end:val_end],
"test": sorted_data[val_end:]
}
Domain-Specific Datasets
Medical Dataset
def create_medical_dataset():
"""Create medical diagnosis dataset"""
generator = SyntheticDataGenerator("sk-...")
# Generate different types
datasets = []
# Symptom analysis
datasets.extend(generator.generate_qa_pairs(
domain="symptom analysis and differential diagnosis",
num_examples=200
))
# Treatment planning
datasets.extend(generator.generate_qa_pairs(
domain="medical treatment planning",
num_examples=150
))
# Drug interactions
datasets.extend(generator.generate_qa_pairs(
domain="drug interactions and contraindications",
num_examples=100
))
return datasets
Legal Dataset
def create_legal_dataset():
"""Create legal document analysis dataset"""
# Contract review
contract_data = generate_contract_examples(num=150)
# Case law analysis
case_law_data = generate_case_law_examples(num=100)
# Legal research
research_data = generate_legal_research_examples(num=100)
return contract_data + case_law_data + research_data
Code Dataset
def create_code_dataset(languages=["python", "javascript", "java"]):
"""Create coding dataset"""
dataset = []
for lang in languages:
# Code explanation
dataset.extend(generate_code_explanation(lang, num=100))
# Bug fixing
dataset.extend(generate_bug_fixing(lang, num=50))
# Code generation
dataset.extend(generate_code_tasks(lang, num=100))
return dataset
HuggingFace Integration
Load from Hub
from datasets import load_dataset
# Load existing dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
# Load from local
dataset = load_dataset("json", data_files="my_data.json")
# Load with streaming (for large datasets)
dataset = load_dataset("large_dataset", streaming=True)
Transform Dataset
# Map function to transform
def transform_to_alpaca(example):
"""Transform any format to Alpaca"""
return {
"instruction": example["question"],
"input": "",
"output": example["answer"]
}
dataset = dataset.map(transform_to_alpaca)
# Filter
dataset = dataset.filter(lambda x: len(x["output"]) > 50)
# Shuffle
dataset = dataset.shuffle(seed=42)
# Select subset
dataset = dataset.select(range(1000))
Upload to Hub
from datasets import Dataset
# Create dataset
data = [...]
dataset = Dataset.from_list(data)
# Push to Hub
dataset.push_to_hub(
"username/my-dataset",
private=False,
token="hf_..."
)
# With splits
from datasets import DatasetDict
splits = split_dataset(data)
dataset_dict = DatasetDict(splits)
dataset_dict.push_to_hub("username/my-dataset", token="hf_...")
Best Practices
1. Start with Quality over Quantity
# 1000 high-quality examples > 10,000 low-quality
# Focus on:
# - Clear instructions
# - Accurate outputs
# - Diverse examples
# - Proper formatting
2. Validate Everything
# Before training, validate:
# - Format correctness
# - No duplicates
# - No PII
# - Quality score > threshold
# - Category balance
3. Version Your Datasets
datasets/
├── medical-v1.0.0.json # Initial
├── medical-v1.1.0.json # Added 500 examples
├── medical-v2.0.0.json # Complete regeneration
└── README.md # Dataset card
4. Document Your Data
Create a dataset card (README.md):
# Medical Q&A Dataset v1.0.0
## Overview
- Size: 1,000 examples
- Format: Alpaca
- Domain: Medical diagnosis
- License: MIT
## Data Sources
- Synthetic generation with Claude 3.5 Sonnet
- Expert validation by 3 MDs
- Quality threshold: 0.85+
## Statistics
- Avg instruction length: 25 words
- Avg output length: 150 words
- Categories: Diagnosis (40%), Treatment (30%), Prevention (30%)
## Example
...
5. Test Before Training
# Always test a sample before full training
sample = dataset.select(range(100))
# Quick train to verify:
# - Format is correct
# - Model can learn
# - No obvious issues
Troubleshooting
Issue: Model Not Learning
Check:
# 1. Is the format correct?
print(dataset[0])
# 2. Is the output length reasonable?
print(f"Avg length: {np.mean([len(x['output']) for x in dataset])}")
# 3. Are there duplicates?
unique_count = len(set(x['output'] for x in dataset))
print(f"Unique: {unique_count}/{len(dataset)}")
# 4. Is the data diverse?
# Check category distribution
Issue: Poor Quality Outputs
Solutions:
# 1. Increase quality threshold
dataset = filter_quality(dataset, min_score=0.85)
# 2. Add more examples
# More data often helps
# 3. Improve prompt engineering
# Better prompts -> better synthetic data
# 4. Human validation
# Review and fix low-quality examples
Issue: Imbalanced Dataset
Solutions:
# 1. Oversample minority class
from imblearn.over_sampling import RandomOverSampler
# 2. Undersample majority class
# 3. Generate more minority examples
# 4. Use weighted loss during training
Summary
Dataset engineering workflow:
- ✓ Define format (Alpaca/ShareGPT/custom)
- ✓ Generate or collect data
- ✓ Clean and filter (remove low-quality, duplicates, PII)
- ✓ Augment if needed (paraphrase, back-translate)
- ✓ Assess quality (automated + human)
- ✓ Split (train/val/test)
- ✓ Upload to HuggingFace
- ✓ Document (dataset card)
- ✓ Version control
Remember: Quality > Quantity. 1000 great examples beats 10,000 mediocre ones.
スコア
総合スコア
リポジトリの品質指標に基づく評価
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
レビュー
レビュー機能は近日公開予定です


