Back to list
SpillwaveSolutions

mastering-postgresql

by SpillwaveSolutions

PostgreSQL development for Python with full-text search, vector similarity, JSONB indexing, and production cloud deployment

0🍴 0📅 Jan 7, 2026

SKILL.md


name: mastering-postgresql description: PostgreSQL development for Python with full-text search (tsvector, tsquery, BM25 via pg_search), vector similarity (pgvector with HNSW/IVFFlat), JSONB and array indexing, and production deployment. Use when creating search features, storing AI embeddings, querying vector similarity, optimizing PostgreSQL indexes, or deploying to AWS RDS/Aurora, GCP Cloud SQL/AlloyDB, or Azure. Covers psycopg2, psycopg3, asyncpg, SQLAlchemy integration, Docker development setup, and index selection strategies. Triggers are Use "PostgreSQL search", "pgvector", "BM25 postgres", "JSONB index", "psycopg", "asyncpg", "PostgreSQL Docker", "AlloyDB vector". Does NOT cover - DBA administration (backup, replication, users), MySQL/MongoDB/Redis, schema design theory, stored procedures. allowed-tools:

  • Read
  • Bash
  • Write

PostgreSQL Python Development

Build search, vector similarity, and data-intensive applications with PostgreSQL and Python.

Quick Reference

TaskGo To
Docker/local setupsetup-and-docker.md
Full-text search & BM25search-fulltext.md
pgvector & JSONB indexingsearch-vectors-json.md
Python drivers & poolspython-drivers.md
Python query patternspython-queries.md
AWS RDS/Auroracloud-aws.md
GCP Cloud SQL/AlloyDBcloud-gcp.md
Azure Flexible Servercloud-azure.md
Neon & Supabasecloud-serverless.md
Cloud common (pooling, config)cloud-common.md

When NOT to Use This Skill

  • DBA tasks: Backup strategies, replication setup, user management, security hardening
  • Other databases: MySQL, MongoDB, Redis, Elasticsearch-specific queries
  • Schema design: Normalization theory, data modeling patterns
  • Stored procedures: PL/pgSQL function development
  • Application frameworks: Django ORM specifics, FastAPI integration details

Quick Start Checklist

Copy this checklist to track progress:

Setup Progress:
- [ ] Docker environment running (docker-compose up -d)
- [ ] Connected to database (psql or Python)
- [ ] Extensions created (pgvector, pg_trgm)
- [ ] Table created with search_vector and embedding columns
- [ ] GIN index on search_vector created
- [ ] HNSW index on embedding created
- [ ] Test full-text query returns results
- [ ] Test vector query returns results

Quick Start: Search + Vectors in 5 Minutes

1. Start PostgreSQL with pgvector

# docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_PASSWORD: devpass
    ports: ["5432:5432"]
    volumes: [pgdata:/var/lib/postgresql/data]
volumes:
  pgdata:
docker-compose up -d

# Verify container is running:
docker-compose ps
# Expected: postgres service with status "Up"

2. Enable Extensions

CREATE EXTENSION vector;      -- pgvector for embeddings
CREATE EXTENSION pg_trgm;     -- Trigram for fuzzy search

-- Verify extensions installed:
SELECT extname, extversion FROM pg_extension 
WHERE extname IN ('vector', 'pg_trgm');
-- Expected: 2 rows with version numbers

3. Create Searchable Table with Vectors

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    embedding vector(1536),
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B')
    ) STORED
);

-- Create indexes
CREATE INDEX idx_docs_search ON documents USING GIN (search_vector);
CREATE INDEX idx_docs_embedding ON documents USING hnsw (embedding vector_cosine_ops);

-- Verify indexes created:
SELECT indexname FROM pg_indexes WHERE tablename = 'documents';
-- Expected: idx_docs_search, idx_docs_embedding, documents_pkey

4. Query from Python

import asyncpg

async def search(pool, query: str, embedding: list[float], limit: int = 10):
    return await pool.fetch("""
        SELECT id, title,
               ts_rank(search_vector, websearch_to_tsquery('english', $1)) AS text_rank,
               embedding <=> $2::vector AS vector_dist
        FROM documents
        WHERE search_vector @@ websearch_to_tsquery('english', $1)
        ORDER BY vector_dist
        LIMIT $3
    """, query, embedding, limit)

# Verify connection works:
# pool = await asyncpg.create_pool('postgresql://postgres:devpass@localhost/postgres')
# rows = await pool.fetch("SELECT 1 AS test")
# assert rows[0]['test'] == 1

Decision Trees

Which Search Approach?

Need search? ─┬─► Exact keyword match ──────► B-tree index + WHERE =
              │
              ├─► Full-text search (FTS) ───► tsvector + GIN + ts_rank
              │
              ├─► Relevance like Google ────► pg_search BM25 (ParadeDB)
              │
              ├─► Typo tolerance ───────────► pg_trgm + similarity()
              │
              ├─► Semantic/AI search ───────► pgvector + embeddings
              │
              └─► Hybrid (keywords + semantic) ► Combine tsvector + pgvector

Which Vector Index?

Vector index? ─┬─► Dataset < 100K rows ────► No index (exact search OK)
               │
               ├─► Need best recall ────────► HNSW (slower build, fast query)
               │
               ├─► Fast index build ────────► IVFFlat (needs data first)
               │
               ├─► On AlloyDB ──────────────► ScaNN (Google optimized)
               │
               ├─► On Azure ────────────────► pg_diskann (32x less memory)
               │
               ├─► Billions of vectors ─────► VectorChord vchordrq (self-host)
               │
               └─► Dimensions > 2000 ───────► halfvec or binary quantization

Which Python Library?

Python lib? ──┬─► Sync, simple, stable ─────► psycopg2
              │
              ├─► Async + modern features ──► psycopg3
              │
              ├─► Max async performance ────► asyncpg
              │
              └─► ORM needed ───────────────► SQLAlchemy + asyncpg/psycopg

Which Index Type for Column?

Column type? ─┬─► Scalar (int, text, timestamp) ─► B-tree (default)
              │
              ├─► JSONB ────────────────────────┬► GIN (general queries)
              │                                 └► GIN jsonb_path_ops (@> only)
              │
              ├─► Array ────────────────────────► GIN
              │
              ├─► tsvector ─────────────────────► GIN (or GiST for updates)
              │
              ├─► vector ───────────────────────► HNSW or IVFFlat
              │
              └─► Range / Geometric ────────────► GiST

Common Patterns

For implementation details, see the reference files:

Index Tuning Quick Reference

HNSW Parameters

ParameterDefaultGuidance
m16Higher = better recall, more memory. 12-48 typical
ef_construction64Higher = better index quality, slower build. 64-200
hnsw.ef_search40Set at query time. Higher = better recall, slower
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=100);
SET hnsw.ef_search = 100;  -- Before querying

-- Verify setting applied:
SHOW hnsw.ef_search;

IVFFlat Parameters

ParameterGuidance
listssqrt(rows) for <1M rows; rows/1000 for >1M
ivfflat.probesStart at sqrt(lists), increase for recall
CREATE INDEX ON docs USING ivfflat (embedding vector_l2_ops) WITH (lists=100);
SET ivfflat.probes = 10;

Troubleshooting Quick Reference

SymptomLikely CauseFix
Seq Scan on indexed columnStats outdatedANALYZE tablename;
Vector search slowNo index or low ef_searchCreate HNSW index, increase ef_search
Poor vector recallIVFFlat probes too lowIncrease ivfflat.probes
FTS not matchingWrong language configCheck to_tsvector('english', ...)
Index not usedQuery doesn't match opsVerify operator class matches query
Connection timeoutPool exhaustedIncrease pool size or fix leaks
Extension not foundNot installedCREATE EXTENSION name;
HNSW build OOMInsufficient memoryIncrease maintenance_work_mem
Filtered queries return few resultsFiltering after index scanEnable hnsw.iterative_scan
Connection drops in productionNo health checkingUse check=ConnectionPool.check_connection
Scaling past 100M vectorspgvector limitsConsider VectorChord vchordrq

For detailed troubleshooting, see search-vectors-json.md.

Script Usage

pip install -r scripts/requirements.txt  # Install dependencies first
ScriptPurposeWhen to Use
setup_extensions.pyInstall pgvector, pg_trgm extensionsInitial database setup
create_search_tables.pyCreate tables with search_vector and embedding columnsAfter extensions installed
health_check.pyCheck index health, bloat, and performanceDiagnosing slow queries
vector_search.py --demoDemonstrate vector similarity queriesLearning pgvector patterns
bulk_insert.pyHigh-performance data loadingImporting large datasets
fts_examples.pyFull-text search query examplesLearning FTS syntax
connection_pool.pyConnection pooling patternsProduction deployments

Example:

python scripts/setup_extensions.py --host localhost --dbname mydb
python scripts/create_search_tables.py --host localhost --dbname mydb
python scripts/health_check.py --host localhost --dbname mydb

Cloud Quick Reference

ProviderpgvectorBM25 SupportConnection Pooling
AWS RDS/Aurora0.8.0pg_textsearch (preview)RDS Proxy
GCP Cloud SQL0.8.0pg_textsearch (preview)Cloud SQL Proxy
GCP AlloyDB0.8.0 + ScaNNpg_textsearch (preview)Built-in
Azure Flexible0.8.0 + pg_diskannpg_textsearch (preview)Built-in PgBouncer
Neonpg_searchBuilt-in
Supabasepg_searchBuilt-in

Serverless options: Neon (scale-to-zero, instant branching) and Supabase (BaaS with auth/real-time) are ideal for dev/test and startups. See cloud-serverless.md.

BM25 Options:

  • pg_search (ParadeDB): Production-ready, self-host or ParadeDB managed service
  • pg_textsearch (TigerData): Preview status, available on managed PostgreSQL services

See provider-specific files for setup commands: AWS | GCP | Azure

Reference Files

Load these for detailed implementation guidance:

ReferenceLoad When
setup-and-docker.mdDocker setup, extension installation, postgresql.conf tuning
search-fulltext.mdFull-text search (FTS), BM25 setup, trigram fuzzy search
search-vectors-json.mdpgvector tuning, JSONB/array indexing, maintenance
python-drivers.mdpsycopg2/psycopg3/asyncpg, connection pools, SQLAlchemy
python-queries.mdBulk inserts, FTS queries, vector queries, JSONB operations
cloud-aws.mdAWS RDS/Aurora setup, RDS Proxy
cloud-gcp.mdGCP Cloud SQL/AlloyDB, ScaNN indexes
cloud-azure.mdAzure Flexible Server, pg_diskann
cloud-serverless.mdNeon, Supabase (scale-to-zero, branching)
cloud-common.mdExtension matrix, pooling, production config, costs

Score

Total Score

65/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

0/10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

3ヶ月以内に更新

+5
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon