Back to list
yonatangross

llm-testing

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 Jan 23, 2026

SKILL.md


name: llm-testing description: Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs. context: fork agent: test-generator version: 2.0.0 tags: [testing, llm, ai, deepeval, ragas, 2026] author: OrchestKit user-invocable: false

LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

Quick Reference

Mock LLM Responses

from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

DeepEval Quality Testing

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

assert_test(test_case, metrics)

Timeout Testing

import asyncio
import pytest

@pytest.mark.asyncio
async def test_respects_timeout():
    with pytest.raises(asyncio.TimeoutError):
        async with asyncio.timeout(0.1):
            await slow_llm_call()

Quality Metrics (2026)

MetricThresholdPurpose
Answer Relevancy≥ 0.7Response addresses question
Faithfulness≥ 0.8Output matches context
Hallucination≤ 0.3No fabricated facts
Context Precision≥ 0.7Retrieved contexts relevant

Anti-Patterns (FORBIDDEN)

# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)

# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))

# ❌ NEVER skip timeout handling
await llm_call()  # No timeout!

# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
    result = await function_under_test()

# ✅ ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
    ...

Key Decisions

DecisionRecommendation
Mock vs VCRVCR for integration, mock for unit
TimeoutAlways test with < 1s timeout
Schema validationTest both valid and invalid
Edge casesTest all null/empty paths
Quality metricsUse multiple dimensions (3-5)

Detailed Documentation

ResourceDescription
references/deepeval-ragas-api.mdDeepEval & RAGAS API reference
examples/test-patterns.mdComplete test examples
checklists/llm-test-checklist.mdSetup and review checklists
scripts/llm-test-template.pyStarter test template
  • vcr-http-recording - Record LLM responses
  • llm-evaluation - Quality assessment
  • unit-testing - Test fundamentals

Capability Details

llm-response-mocking

Keywords: mock LLM, fake response, stub LLM, mock AI Solves:

  • Mock LLM responses in tests
  • Create deterministic AI test fixtures
  • Avoid live API calls in CI

async-timeout-testing

Keywords: timeout, async test, wait for, polling Solves:

  • Test async LLM operations
  • Handle timeout scenarios
  • Implement polling assertions

structured-output-validation

Keywords: structured output, JSON validation, schema validation, output format Solves:

  • Validate structured LLM output
  • Test JSON schema compliance
  • Assert output structure

deepeval-assertions

Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:

  • Use DeepEval for LLM assertions
  • Implement metric-based tests
  • Configure quality thresholds

golden-dataset-testing

Keywords: golden dataset, golden test, reference output, expected output Solves:

  • Test against golden datasets
  • Compare with reference outputs
  • Implement regression testing

vcr-recording

Keywords: VCR, cassette, record, replay, HTTP recording Solves:

  • Record LLM API responses
  • Replay recordings in tests
  • Create deterministic test suites

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon