
using-spacy-nlp
by SpillwaveSolutions
Claude Code plugin for NLP with spaCy 3.x - entity extraction, classification, batch processing
SKILL.md
name: using-spacy-nlp description: Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.
spaCy NLP
Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.
Contents
Scope
In Scope:
- spaCy 3.x installation and text processing
- TextCategorizer training for document classification
- Production deployment and optimization patterns
Out of Scope (use other tools/skills):
- Training custom NER models (different workflow)
- spaCy 2.x (deprecated, incompatible with 3.x)
- Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
- Custom tokenizers or language models
Quick Start
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Entities
for ent in doc.ents:
print(ent.text, ent.label_)
# Tokens with attributes
for token in doc:
print(token.text, token.pos_, token.dep_)
Installation
Standard Setup
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
Model Selection
| Model | Size | Speed | Use Case |
|---|---|---|---|
en_core_web_sm | 12 MB | Fastest | Prototyping, speed-critical |
en_core_web_md | 40 MB | Fast | General use with word vectors |
en_core_web_lg | 560 MB | Fast | Semantic similarity tasks |
en_core_web_trf | 438 MB | Slow | Maximum accuracy (GPU) |
Verify Installation
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")
For detailed installation options (conda, GPU, transformers): See references/installation.md
Text Processing
Basic Pipeline
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")
# Tokenization + attributes
for token in doc:
print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")
Named Entity Recognition
for ent in doc.ents:
print(ent.text, ent.label_) # "Apple Inc." ORG, "Steve Jobs" PERSON
For entity types, filtering, and span details: See references/basic-usage.md
Batch Processing (Critical for Production)
# WRONG - slow
for text in texts:
doc = nlp(text) # Don't do this
# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
process(doc)
# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))
Disable Unused Components
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])
For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md
Training Classifiers
Train custom text classifiers with TextCategorizer.
Workflow Overview
- Prepare data → Run
scripts/prepare_training_data.py - Generate config → Run
scripts/generate_config.pyor useassets/config_textcat.cfg - Validate →
python -m spacy debug data config.cfg(catches issues before training) - Train →
python -m spacy train config.cfg --output ./output - Evaluate → Run
scripts/evaluate_model.py - Use →
nlp = spacy.load("./output/model-best")
Data Format
Training data uses spaCy's DocBin format. Example input (JSON):
[
{"text": "Quarterly revenue exceeded expectations", "label": "Business"},
{"text": "Fixed null pointer exception in parser", "label": "Programming"},
{"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]
Convert with script:
python scripts/prepare_training_data.py \
--input data.json \
--output-train train.spacy \
--output-dev dev.spacy \
--split 0.8
Training Command
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"
# Or use template
cp assets/config_textcat.cfg config.cfg
# Train
python -m spacy train config.cfg --output ./output
# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0
Using Trained Model
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}") # DevOps: 94.2%
For detailed training guide: See references/text-classification.md
Troubleshooting
Model Not Found (E050)
OSError: [E050] Can't find model 'en_core_web_sm'
Fix:
python -m spacy download en_core_web_sm
Alternative (avoids path issues):
import en_core_web_sm
nlp = en_core_web_sm.load()
Memory Issues
Symptoms: OOM errors, slow processing
Fixes:
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])
# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
doc = nlp(chunk)
# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
for doc in nlp.pipe(batch):
process(doc)
GPU Not Working
import spacy
# Must call BEFORE loading model
if spacy.prefer_gpu():
print("Using GPU")
else:
print("GPU not available")
nlp = spacy.load("en_core_web_trf") # Now loads on GPU
Version Compatibility
spaCy 2.x models do not work with spaCy 3.x. Check compatibility:
python -m spacy validate
For more troubleshooting: See references/troubleshooting.md
Production Deployment
Package Model
python -m spacy package ./output/model-best ./packages \
--name my_classifier \
--version 1.0.0
pip install ./packages/en_my_classifier-1.0.0/
FastAPI Server
Use the production template:
python scripts/serve_model.py --model ./output/model-best --port 8000
Or customize from template:
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_my_classifier")
@app.post("/classify")
async def classify(text: str):
with nlp.memory_zone():
doc = nlp(text)
return {
"category": max(doc.cats, key=doc.cats.get),
"scores": doc.cats
}
Performance Optimization
| Technique | Speedup | When to Use |
|---|---|---|
| Disable components | 2-3x | Don't need all annotations |
nlp.pipe() | 5-10x | Processing multiple texts |
| Multiprocessing | 2-4x | CPU-bound, many cores |
| GPU | 2-5x | Transformer models |
For evaluation metrics and hyperparameter tuning: See references/production.md
Scripts Reference
| Script | Purpose | Usage |
|---|---|---|
prepare_training_data.py | Convert JSON to DocBin | python scripts/prepare_training_data.py --input data.json |
generate_config.py | Create training config | python scripts/generate_config.py --categories "A,B,C" |
evaluate_model.py | Detailed metrics | python scripts/evaluate_model.py --model ./output/model-best |
serve_model.py | FastAPI server | python scripts/serve_model.py --model ./model --port 8000 |
Assets Reference
| Asset | Purpose | Usage |
|---|---|---|
config_textcat.cfg | Base training config | Copy and customize for your labels |
training_data_template.json | Data format example | Reference for preparing your data |
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon


