using-spacy-nlp

Name: using-spacy-nlp
Rating: 55
Author: SpillwaveSolutions

by SpillwaveSolutions

Claude Code plugin for NLP with spaCy 3.x - entity extraction, classification, batch processing

⭐ 1🍴 1📅 Jan 17, 2026

agentic-skill claude-code-skill

View on GitHub Run in Manus

SKILL.md

name: using-spacy-nlp description: Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.

spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Quick Start
Installation
Text Processing
Training Classifiers
Troubleshooting
Production Deployment

Scope

In Scope:

spaCy 3.x installation and text processing
TextCategorizer training for document classification
Production deployment and optimization patterns

Out of Scope (use other tools/skills):

Training custom NER models (different workflow)
spaCy 2.x (deprecated, incompatible with 3.x)
Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
Custom tokenizers or language models

Quick Start

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

Model	Size	Speed	Use Case
`en_core_web_sm`	12 MB	Fastest	Prototyping, speed-critical
`en_core_web_md`	40 MB	Fast	General use with word vectors
`en_core_web_lg`	560 MB	Fast	Semantic similarity tasks
`en_core_web_trf`	438 MB	Slow	Maximum accuracy (GPU)

Verify Installation

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md

Text Processing

Basic Pipeline

nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md

Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

Prepare data → Run scripts/prepare_training_data.py
Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
Validate → python -m spacy debug data config.cfg (catches issues before training)
Train → python -m spacy train config.cfg --output ./output
Evaluate → Run scripts/evaluate_model.py
Use → nlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md

Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

python -m spacy download en_core_web_sm

Alternative (avoids path issues):

import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

python -m spacy validate

For more troubleshooting: See references/troubleshooting.md

Production Deployment

Package Model

python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

Technique	Speedup	When to Use
Disable components	2-3x	Don't need all annotations
`nlp.pipe()`	5-10x	Processing multiple texts
Multiprocessing	2-4x	CPU-bound, many cores
GPU	2-5x	Transformer models

For evaluation metrics and hyperparameter tuning: See references/production.md

Scripts Reference

Script	Purpose	Usage
`prepare_training_data.py`	Convert JSON to DocBin	`python scripts/prepare_training_data.py --input data.json`
`generate_config.py`	Create training config	`python scripts/generate_config.py --categories "A,B,C"`
`evaluate_model.py`	Detailed metrics	`python scripts/evaluate_model.py --model ./output/model-best`
`serve_model.py`	FastAPI server	`python scripts/serve_model.py --model ./model --port 8000`

Assets Reference

Asset	Purpose	Usage
`config_textcat.cfg`	Base training config	Copy and customize for your labels
`training_data_template.json`	Data format example	Reference for preparing your data

Score

Total Score

55/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

using-spacy-nlp

SKILL.md

spaCy NLP

Contents

Scope

Quick Start

Installation

Standard Setup

Model Selection

Verify Installation

Text Processing

Basic Pipeline

Named Entity Recognition

Batch Processing (Critical for Production)

Disable Unused Components

Training Classifiers

Workflow Overview

Data Format

Training Command

Using Trained Model

Troubleshooting

Model Not Found (E050)

Memory Issues

GPU Not Working

Version Compatibility

Production Deployment

Package Model

FastAPI Server

Performance Optimization

Scripts Reference

Assets Reference

Score

Reviews

pinme

changelog-generator

softcopyright

drawnote-skill

tidymydesktop

unit-test-parameterized

using-spacy-nlp

SKILL.md

spaCy NLP

Contents

Scope

Quick Start

Installation

Standard Setup

Model Selection

Verify Installation

Text Processing

Basic Pipeline

Named Entity Recognition

Batch Processing (Critical for Production)

Disable Unused Components

Training Classifiers

Workflow Overview

Data Format

Training Command

Using Trained Model

Troubleshooting

Model Not Found (E050)

Memory Issues

GPU Not Working

Version Compatibility

Production Deployment

Package Model

FastAPI Server

Performance Optimization

Scripts Reference

Assets Reference

Score

Reviews

Related

Related Skills

pinme

changelog-generator

softcopyright

drawnote-skill

tidymydesktop

unit-test-parameterized