testing

Name: testing
Rating: 55
Author: simonheimlicher

by simonheimlicher

Claude Code plugin marketplace: testing methodology, Python workflow, and productivity skills

⭐ 1🍴 0📅 Jan 14, 2026

agentic-ai claude-code spec-driven-development

View on GitHub Run in Manus

SKILL.md

name: testing description: Learn how to test code without mocking. Use when learning testing approach or before writing tests. allowed-tools: Read, Bash, Glob, Grep, Write, Edit

Test Strategy (Foundational Skill)

You are the foundational test strategy authority. All other skills—architect, coder, reviewer—MUST consult you before making decisions.

Fundamentals

In this projec, we follow the Behavior-Driven Development methodology.

Tests are not everything—but without tests, everything is nothing.

The One Question That Matters

When a user runs your CLI, visits your website, or opens your app—will it ACTUALLY WORK?

Not "did the tests pass?" Not "is coverage high?" Will. It. Actually. Work.

Everything in this skill flows from that question. Tests exist to give you justified confidence that the answer is YES.

The Confidence Pyramid

┌─────────────────────┐
│      LEVEL 3        │  "Does it work in the real world?"
│   System / E2E      │  Real credentials, real services
│                     │  Full user workflows
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│      LEVEL 2        │  "Does our code work with real infrastructure?"
│    Integration      │  Real binaries, real databases
│                     │  Test harnesses required
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│      LEVEL 1        │  "Is the logic of our functions correct?"
│    Unit / Pure      │  No external dependencies
│                     │  Dependency injection, temp dirs, test runner only
└─────────────────────┘

Build from the bottom up. Each level answers a question the previous level cannot.

🚨 The Cardinal Rule: No Mocking

Mocking is a confession that your code is poorly designed—or that you're testing at the wrong level.

If you need to mock something:

Redesign to use dependency injection, OR
Don't test that interaction at this level—push it to Level 2 or 3

Mocking gives you a test that passes while your production code fails. This is worse than no test at all.

🚨 Progress Tests vs Regression Tests

CRITICAL INVARIANT: The production test suite (test/ or tests/) MUST ALWAYS PASS.

Location	Name	May Fail?	Purpose
`specs/.../tests/`	Progress tests	YES	TDD red-green during development
`test/` or `tests/`	Regression tests	NO	Protect working functionality

The Rule: Never write failing tests directly in the regression suite. Write progress tests first, graduate them when complete.

Level 1: Unit / Pure Logic

The Question This Level Answers

"Is our logic correct, independent of any external system?"

If a user's request fails, Level 1 tests help you instantly rule out (or identify) bugs in your core logic.

What You Can Use

Allowed	Examples	Why It's OK
Test runner	pytest, vitest, jest, go test	Part of dev environment
Language primitives	temp files, env vars, in-memory structures	Part of runtime
Standard dev tools	git, node, npm, npx, curl, python, cat, grep, sed, awk	Available in CI without setup
Dependency injection	Pass interfaces, not implementations	Enables isolation
Factories/builders	Generate test data programmatically	Reproducible tests

Standard dev tools are those available in CI environments without installation (git, node, npm, npx, curl, python, cat, grep, sed, awk). Project-specific tools (make, pip install, hugo, custom binaries) are Level 2.

CRITICAL FILESYSTEM RULE:

All Level 1 tests MUST use OS-provided temporary directories exclusively (os.tmpdir(), tempfile.mkdtemp(), /tmp/, etc.). Never write outside temporary directories. This ensures tests are isolated, reentrant, and don't pollute the user's filesystem. Fast execution is possible thanks to modern SSDs.

What You Cannot Use

Forbidden	Why
Real databases	That's Level 2
Real HTTP calls	That's Level 2 or 3
Project-specific binaries (ffmpeg, hugo)	That's Level 2
Project-specific build tools (make, npm)	That's Level 2
Mocks of external systems	Never. Use DI and don't test the interaction here

The Key Insight

If you can't test something without mocking an external system, you're at the wrong level.

Don't mock the database—design your code so the business logic doesn't know about the database. Then test the business logic here, and test the database integration at Level 2.

Pattern: Dependency Injection

## ❌ BAD: Hardcoded dependency, requires mocking to test
class OrderProcessor:
    def process(self, order):
        db = PostgresDatabase()  # Hardcoded!
        db.save(order)
        EmailService().send(order.customer, "Order confirmed")


## ✅ GOOD: Injected dependencies, testable without mocks
class OrderProcessor:
    def __init__(self, repository, notifier):
        self.repository = repository
        self.notifier = notifier

    def process(self, order):
        self.repository.save(order)
        self.notifier.notify(order.customer, "Order confirmed")


## Level 1 test: Use simple in-memory implementations
def test_order_processing_saves_and_notifies():
    # These are NOT mocks—they're real implementations with test-friendly behavior
    saved_orders = []
    notifications = []

    class InMemoryRepo:
        def save(self, order):
            saved_orders.append(order)

    class InMemoryNotifier:
        def notify(self, to, msg):
            notifications.append((to, msg))

    processor = OrderProcessor(InMemoryRepo(), InMemoryNotifier())
    processor.process(Order(customer="alice@test.com", items=["book"]))

    assert len(saved_orders) == 1
    assert notifications == [("alice@test.com", "Order confirmed")]

Pattern: Pure Function Testing

## Pure functions are Level 1's sweet spot
def calculate_shipping(weight_kg: float, distance_km: float, express: bool) -> float:
    base = weight_kg * 0.5 + distance_km * 0.01
    return base * 1.5 if express else base


def test_shipping_calculation():
    assert calculate_shipping(10, 100, express=False) == 6.0
    assert calculate_shipping(10, 100, express=True) == 9.0

Pattern: Temporary Directories

import tempfile
from pathlib import Path


def test_config_file_generation():
    with tempfile.TemporaryDirectory() as tmpdir:
        config_path = Path(tmpdir) / "config.yaml"

        generate_config(config_path, {"debug": True})

        content = config_path.read_text()
        assert "debug: true" in content

What Level 1 Tests Prove

✅ Your business logic handles edge cases correctly
✅ Your parsing/validation works
✅ Your algorithms produce correct results
✅ Your error handling works as expected

What Level 1 Tests Cannot Prove

❌ That PostgreSQL accepts your queries
❌ That the HTTP API returns what you expect
❌ That the CLI binary does what the docs say
❌ That the real system works end-to-end

This is why Level 2 exists.

Level 2: Integration

The Question This Level Answers

"Does our code correctly interact with real external dependencies?"

Level 1 proved your logic is correct. Level 2 proves your code actually works with the real databases, binaries, and services it depends on.

Integration testing covers two scenarios:

1. Project-specific tools:

CLI tools NOT in standard dev environment: Hugo, Caddy, Claude Code, FFmpeg, etc.
Project-specific build tools: Make, Gradle, Maven, etc.
Local execution only (no network required)

2. Virtualized environments:

Docker containers and containerized test services
Virtual machines or sandboxed environments
Creates infrastructure dependencies beyond standard developer setup

The Critical Requirement: Test Harnesses

Before writing ANY Level 2 test, you must identify or build the test harness for each external dependency.

What Is a Test Harness?

A test harness is the infrastructure that lets you run tests against a real dependency in a controlled, repeatable way.

Dependency Type	Harness Examples
Database	Docker container with test schema, or test database with cleanup
HTTP API	Local mock server, or sandbox/staging environment
CLI binary	Installed binary with known version, fixture files
File system	Temp directories with fixture data
Message queue	Docker container or embedded instance

🚨 THE RULE: If You Don't Know the Harness, STOP

If you cannot describe the test harness for a dependency, you MUST ask the user before proceeding.

Do not guess. Do not assume. Ask:

I need to write integration tests for [dependency].

To proceed, I need to know:
1. What test harness exists or should I build?
2. How do I start/stop/reset it?
3. Where are fixture files or seed data?
4. What environment variables configure it?

Please provide this information or point me to existing test infrastructure.

Pattern: Document Your Harnesses

Every project should have a test/harnesses/ directory or documentation:

test/
├── harnesses/
│   ├── README.md           # Overview of all harnesses
│   ├── postgres.py         # Start/stop/reset Postgres container
│   ├── redis.py            # Start/stop/reset Redis container
│   └── hugo.py             # Verify Hugo binary, create fixture sites
├── fixtures/
│   ├── sample-site/        # Fixture data for Hugo tests
│   └── seed-data.sql       # Seed data for Postgres tests
└── integration/
    ├── test_database.py
    └── test_hugo_build.py

Pattern: Real Database Testing

import pytest
from harnesses.postgres import PostgresHarness


@pytest.fixture(scope="module")
def database():
    """Harness: Docker Postgres with schema applied"""
    harness = PostgresHarness()
    harness.start()
    harness.apply_schema("schema.sql")
    yield harness.connection_string
    harness.stop()


def test_user_repository_saves_and_retrieves(database):
    repo = UserRepository(database)

    user = User(email="test@example.com", name="Test User")
    repo.save(user)

    retrieved = repo.find_by_email("test@example.com")

    assert retrieved is not None
    assert retrieved.name == "Test User"

Pattern: Real Binary Testing

import subprocess
import tempfile
from pathlib import Path


@pytest.fixture
def hugo_site():
    """Harness: Temp directory with minimal Hugo site structure"""
    with tempfile.TemporaryDirectory() as tmpdir:
        site_dir = Path(tmpdir)
        # Create minimal Hugo site
        (site_dir / "config.toml").write_text('title = "Test Site"')
        (site_dir / "content").mkdir()
        (site_dir / "content" / "_index.md").write_text("# Home")
        yield site_dir


def test_hugo_builds_site(hugo_site):
    """Level 2: Verify Hugo binary actually builds our site structure"""
    result = subprocess.run(
        ["hugo", "--source", str(hugo_site)], capture_output=True, text=True
    )

    assert result.returncode == 0
    assert (hugo_site / "public" / "index.html").exists()

Pattern: Real HTTP API Testing

import pytest
import httpx
from harnesses.api_server import LocalAPIServer


@pytest.fixture(scope="module")
def api_server():
    """Harness: Local instance of the API under test"""
    server = LocalAPIServer(port=8089)
    server.start()
    server.wait_until_ready()
    yield server.base_url
    server.stop()


def test_api_creates_resource(api_server):
    response = httpx.post(f"{api_server}/resources", json={"name": "test-resource"})

    assert response.status_code == 201
    assert response.json()["name"] == "test-resource"

What Level 2 Tests Prove

✅ PostgreSQL accepts your queries and returns expected results
✅ Hugo builds your site structure correctly
✅ The HTTP client handles real responses correctly
✅ File operations work on real file systems

What Level 2 Tests Cannot Prove

❌ That production credentials work
❌ That third-party APIs behave the same in prod
❌ That the full user workflow succeeds
❌ That performance is acceptable under load

This is why Level 3 exists.

Level 3: System / End-to-End

The Question This Level Answers

"Does the complete system work the way users will actually use it?"

Level 2 proved your integrations work locally. Level 3 proves the entire system works with real credentials against real (test) environments.

The Critical Requirement: Credentials & Test Accounts

Before writing ANY Level 3 test, you must know where the credentials are and what test accounts exist.

🚨 THE RULE: No Credentials, No Level 3 Tests

If you do not have explicit information about test credentials and accounts, you MUST ask the user before proceeding.

Do not guess. Do not use production credentials. Ask:

I need to write end-to-end tests that use [external service].

To proceed, I need to know:
1. Where are the test credentials stored? (env vars, secrets manager, etc.)
2. What test accounts/environments exist?
3. Are there rate limits or quotas on the test account?
4. How do I reset test data between runs?
5. Is there a staging/sandbox environment, or do tests run against production?

Please provide this information before I proceed with Level 3 tests.

Pattern: Credential Management

import os
import pytest

## Document where credentials come from
CREDENTIALS_SOURCE = """
Level 3 tests require these environment variables:
- STRIPE_TEST_API_KEY: From 1Password vault "Engineering/Test Credentials"
- SENDGRID_TEST_API_KEY: From .env.test (not committed)
- TEST_USER_EMAIL: test-automation@example.com
- TEST_USER_PASSWORD: In 1Password vault "Engineering/Test Credentials"
"""


@pytest.fixture(scope="session")
def stripe_client():
    api_key = os.environ.get("STRIPE_TEST_API_KEY")
    if not api_key:
        pytest.skip(f"STRIPE_TEST_API_KEY not set.\n{CREDENTIALS_SOURCE}")
    return StripeClient(api_key=api_key)


@pytest.fixture(scope="session")
def authenticated_user(browser):
    email = os.environ.get("TEST_USER_EMAIL")
    password = os.environ.get("TEST_USER_PASSWORD")
    if not email or not password:
        pytest.skip(f"Test user credentials not set.\n{CREDENTIALS_SOURCE}")

    # Log in once per session
    browser.goto("/login")
    browser.fill("[name=email]", email)
    browser.fill("[name=password]", password)
    browser.click("[type=submit]")
    browser.wait_for_url("/dashboard")

    yield browser

Pattern: Full User Workflow

def test_complete_purchase_workflow(authenticated_user, stripe_client):
    """
    Level 3: Complete user workflow with real services

    Prerequisites:
    - TEST_USER has a saved payment method in Stripe test mode
    - Product "test-product" exists in test catalog
    """
    browser = authenticated_user

    # User browses to product
    browser.goto("/products/test-product")

    # User adds to cart
    browser.click("[data-testid=add-to-cart]")
    browser.wait_for_selector("[data-testid=cart-count]:has-text('1')")

    # User checks out
    browser.goto("/checkout")
    browser.click("[data-testid=pay-now]")

    # Verify order completed
    browser.wait_for_url("/order-confirmation")
    order_id = browser.locator("[data-testid=order-id]").text_content()

    # Verify in Stripe
    charges = stripe_client.charges.list(limit=1)
    assert charges.data[0].metadata["order_id"] == order_id

Pattern: CLI End-to-End

import subprocess
import os


def test_cli_full_workflow():
    """
    Level 3: CLI works with real credentials against real services

    Prerequisites:
    - LHCI_TOKEN set in environment
    - Test site deployed at https://staging.example.com
    """
    # Verify prerequisites
    if not os.environ.get("LHCI_TOKEN"):
        pytest.skip("LHCI_TOKEN not set")

    result = subprocess.run(
        [
            "hugolit",
            "run",
            "--url",
            "https://staging.example.com",
            "--upload",  # Uploads to LHCI server
        ],
        capture_output=True,
        text=True,
        env={**os.environ, "CI": "true"},
    )

    assert result.returncode == 0
    assert "Report uploaded" in result.stdout

What Level 3 Tests Prove

✅ Real credentials work
✅ Real third-party APIs behave as expected
✅ The full user workflow succeeds
✅ All integrations work together in a real environment

When Level 3 Tests Fail

Level 3 failures are the most serious because they mean users will experience failures. When a Level 3 test fails:

Check credentials: Did they expire? Get rotated?
Check third-party status: Is the external service down?
Check test data: Did seed data get corrupted or deleted?
Then look at your code

The Testing Decision Protocol

When you need to test a feature, execute these phases IN ORDER.

Phase 1: List the Guarantees You Need

Before writing any test, list what you need to guarantee:

### Guarantees Needed for "User Registration"

1. Email validation logic rejects invalid formats
2. Password hashing produces correct hashes
3. Database correctly stores and retrieves user records
4. Email service actually sends the welcome email
5. Complete signup flow works from the user's perspective

Phase 2: Assign Each Guarantee to a Level

Guarantee	Level	Why This Level?
Email validation logic	Level 1	Pure function, no dependencies
Password hashing	Level 1	Pure function, deterministic
Database stores/retrieves users	Level 2	Needs real database
Email service sends email	Level 2	Needs email harness (Mailhog/sandbox)
Complete signup flow	Level 3	Needs real credentials, real services

Phase 3: Identify Harnesses (Level 2) and Credentials (Level 3)

Before writing Level 2 or Level 3 tests:

For Level 2, document your harnesses:

### Test Harnesses Required

- **PostgreSQL**: Docker container via `docker-compose.test.yml`
  - Start: `docker-compose -f docker-compose.test.yml up -d postgres`
  - Reset: `docker-compose exec postgres psql -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"`
- **Email (Mailhog)**: Docker container captures all outgoing email
  - Start: `docker-compose -f docker-compose.test.yml up -d mailhog`
  - API: `http://localhost:8025/api/v2/messages`

For Level 3, document your credentials:

### Test Credentials Required

- **TEST_USER_EMAIL**: `test-automation@example.com`
- **TEST_USER_PASSWORD**: In 1Password vault "Engineering/Test Credentials"
- **SENDGRID_API_KEY**: In `.env.test` (get from team lead)

### Test Environment

- Staging URL: `https://staging.example.com`
- Test Stripe account: Uses `sk_test_*` keys (auto-set in staging)

Phase 4: Write Tests Bottom-Up

Start with Level 1. Only move to Level 2 when Level 1 is complete. Only move to Level 3 when Level 2 is complete.

## test/unit/test_registration.py (Level 1)
def test_email_validation_rejects_invalid():
    assert validate_email("notanemail") is False
    assert validate_email("also@invalid") is False
    assert validate_email("valid@example.com") is True


def test_password_hashing_is_deterministic():
    hash1 = hash_password("secret123")
    hash2 = hash_password("secret123")
    assert verify_password("secret123", hash1)
    assert verify_password("secret123", hash2)


## test/integration/test_registration.py (Level 2)
def test_user_repository_creates_user(database):
    repo = UserRepository(database)
    user = repo.create(email="new@example.com", password_hash="hash123")

    retrieved = repo.find_by_email("new@example.com")
    assert retrieved.id == user.id


def test_email_service_sends_welcome_email(mailhog):
    service = EmailService(smtp_url=mailhog.smtp_url)
    service.send_welcome("recipient@example.com")

    messages = mailhog.get_messages()
    assert len(messages) == 1
    assert messages[0]["To"] == "recipient@example.com"


## test/e2e/test_registration.py (Level 3)
def test_complete_signup_workflow(browser, test_credentials):
    browser.goto("https://staging.example.com/signup")
    browser.fill("[name=email]", "e2e-test@example.com")
    browser.fill("[name=password]", "SecurePass123!")
    browser.click("[type=submit]")

    browser.wait_for_url("/welcome")
    assert "Welcome" in browser.title()

Anti-Patterns

Anti-Pattern: Mocking External Systems

## ❌ NEVER DO THIS
@patch("myapp.database.PostgresClient")
def test_saves_user(mock_db):
    mock_db.save.return_value = {"id": 1}
    result = save_user({"email": "test@example.com"})
    mock_db.save.assert_called_once()  # What did we prove? NOTHING.

Instead: Use dependency injection at Level 1, real database at Level 2.

Anti-Pattern: Skipping Levels

## ❌ Going straight to E2E without unit/integration coverage
def test_full_checkout():
    # If this fails, you have no idea if it's:
    # - Your pricing logic (Level 1)
    # - Your database queries (Level 2)
    # - Your payment integration (Level 2)
    # - The third-party service (Level 3)
    # - Your test credentials (Level 3)
    ...

Instead: Build confidence from the bottom up.

Anti-Pattern: Guessing at Harnesses or Credentials

## ❌ Assuming a database exists
def test_integration():
    db = connect("postgresql://localhost:5432/test")  # Does this exist? Who knows!


## ❌ Hardcoding credentials
def test_e2e():
    stripe = Stripe(api_key="sk_test_abc123")  # Will this work? For how long?

Instead: Document harnesses and credential sources explicitly. Ask if you don't know.

Anti-Pattern: Testing Implementation

## ❌ Testing HOW, not WHAT
def test_uses_correct_query():
    repo = UserRepository(mock_db)
    repo.find_active_users()
    mock_db.query.assert_called_with("SELECT * FROM users WHERE active = true")


## ✅ Test the BEHAVIOR
def test_returns_only_active_users(database):
    seed_data(
        database,
        [
            {"email": "active@test.com", "active": True},
            {"email": "inactive@test.com", "active": False},
        ],
    )

    repo = UserRepository(database)
    users = repo.find_active_users()

    assert len(users) == 1
    assert users[0].email == "active@test.com"

Quick Reference: When to Use Each Level

If you need to verify...	Use Level
Business logic correctness	1
Parsing/validation	1
Algorithm output	1
Error handling	1
Database queries work	2
HTTP calls work	2
CLI binary works	2
File I/O works	2
Full user workflow	3
Real credentials work	3
Production-like environment	3
Third-party service integration	3

Checklist Before Declaring Tests Complete

All critical guarantees have tests at the appropriate level
Level 1 tests use DI, not mocking
Level 2 harnesses are documented and reproducible
Level 3 credentials are documented (not hardcoded)
Tests verify behavior, not implementation
Fast-failing: environment checks run first
Progress tests in specs/, regression tests in test/
All regression tests pass

When You're Stuck: The Questions to Ask

For Level 1:

"Can I verify this behavior using only the test runner, language primitives, and dependency injection?"

If no → move to Level 2

For Level 2:

"What test harness do I need? How do I start/stop/reset it?"

If you don't know → STOP AND ASK THE USER

For Level 3:

"Where are the credentials? What test accounts exist?"

If you don't know → STOP AND ASK THE USER

Remember: The goal is not "passing tests." The goal is justified confidence that your code works in the real world. Every test should move you closer to that confidence. If a test doesn't, delete it.

Score

Total Score

55/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

testing

SKILL.md

name: testing description: Learn how to test code without mocking. Use when learning testing approach or before writing tests. allowed-tools: Read, Bash, Glob, Grep, Write, Edit

Test Strategy (Foundational Skill)

Fundamentals

The One Question That Matters

The Confidence Pyramid

🚨 The Cardinal Rule: No Mocking

🚨 Progress Tests vs Regression Tests

Level 1: Unit / Pure Logic

The Question This Level Answers

What You Can Use

What You Cannot Use

The Key Insight

Pattern: Dependency Injection

Pattern: Pure Function Testing

Pattern: Temporary Directories

What Level 1 Tests Prove

What Level 1 Tests Cannot Prove

Level 2: Integration

The Question This Level Answers

The Critical Requirement: Test Harnesses

What Is a Test Harness?

🚨 THE RULE: If You Don't Know the Harness, STOP

Pattern: Document Your Harnesses

Pattern: Real Database Testing

Pattern: Real Binary Testing

Pattern: Real HTTP API Testing

What Level 2 Tests Prove

What Level 2 Tests Cannot Prove

Level 3: System / End-to-End

The Question This Level Answers

The Critical Requirement: Credentials & Test Accounts

🚨 THE RULE: No Credentials, No Level 3 Tests

Pattern: Credential Management

Pattern: Full User Workflow

Pattern: CLI End-to-End

What Level 3 Tests Prove

When Level 3 Tests Fail

The Testing Decision Protocol

Phase 1: List the Guarantees You Need

Phase 2: Assign Each Guarantee to a Level

Phase 3: Identify Harnesses (Level 2) and Credentials (Level 3)

Phase 4: Write Tests Bottom-Up

Anti-Patterns

Anti-Pattern: Mocking External Systems

Anti-Pattern: Skipping Levels

Anti-Pattern: Guessing at Harnesses or Credentials

Anti-Pattern: Testing Implementation

Quick Reference: When to Use Each Level

Checklist Before Declaring Tests Complete

When You're Stuck: The Questions to Ask

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices