Back to list
yonatangross

resilience-patterns

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 Jan 23, 2026

SKILL.md


name: resilience-patterns description: Production-grade fault tolerance for distributed systems. Use when implementing circuit breakers, retry with exponential backoff, bulkhead isolation patterns, or building resilience into LLM API integrations. context: fork agent: backend-system-architect version: 1.0.0 author: OrchestKit AI Agent Hub tags: [resilience, circuit-breaker, bulkhead, retry, fault-tolerance] user-invocable: false

Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

  • Building fault-tolerant multi-agent systems
  • Implementing LLM API integrations with proper error handling
  • Designing distributed workflows that need graceful degradation
  • Adding observability to failure scenarios
  • Protecting systems from cascade failures

Core Patterns

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

Key Configuration:

  • failure_threshold: Failures before opening (default: 5)
  • recovery_timeout: Seconds before attempting recovery (default: 30)
  • half_open_requests: Probes to allow in half-open (default: 1)

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

TierWorkersQueueTimeoutUse Case
1 (Critical)510300sSynthesis, quality gate
2 (Standard)35120sContent analysis agents
3 (Optional)2360sEnrichment, caching

3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

Error Classification for Retries:

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-20250514 (primary)                           |
|   2. gpt-4o-mini (fallback)                                       |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token Budget Management:

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

PatternWhen to UseKey Benefit
Circuit BreakerExternal service callsPrevent cascade failures
BulkheadMulti-tenant/multi-agentIsolate failures
Retry + BackoffTransient failuresAutomatic recovery
Fallback ChainCritical operationsGraceful degradation
Token BudgetLLM callsCost control, prevent failures

OrchestKit Integration Points

  1. Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
  2. LLM Calls: All model invocations use fallback chain + retry logic
  3. External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
  4. Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

  • references/circuit-breaker.md - Deep dive on circuit breaker pattern
  • references/bulkhead-pattern.md - Bulkhead isolation strategies
  • references/retry-strategies.md - Retry algorithms and error classification
  • references/llm-resilience.md - LLM-specific patterns
  • references/error-classification.md - How to categorize errors

Templates (Code Patterns)

  • scripts/circuit-breaker.py - Ready-to-use circuit breaker class
  • scripts/bulkhead.py - Semaphore-based bulkhead implementation
  • scripts/retry-handler.py - Configurable retry decorator
  • scripts/llm-fallback-chain.py - Multi-model fallback pattern
  • scripts/token-budget.py - Token budget guard implementation

Examples

  • examples/orchestkit-workflow-resilience.md - Full OrchestKit integration example

Checklists

  • checklists/pre-deployment-resilience.md - Production readiness checklist
  • checklists/circuit-breaker-setup.md - Circuit breaker configuration guide

2026 Best Practices

  1. Adaptive Thresholds: Use sliding windows, not fixed counters
  2. Observability First: Every circuit trip = alert + metric + trace
  3. Graceful Degradation: Always have a fallback, even if partial
  4. Health Endpoints: Separate health check from circuit state
  5. Chaos Testing: Regularly test failure scenarios in staging

  • observability-monitoring - Metrics and alerting for circuit breaker state changes
  • caching-strategies - Cache as fallback layer in degradation scenarios
  • error-handling-rfc9457 - Structured error responses for resilience failures
  • background-jobs - Async processing with retry and failure handling

Key Decisions

DecisionChoiceRationale
Circuit breaker recoveryHalf-open probeGradual recovery, prevents immediate re-failure
Retry algorithmExponential backoff + jitterPrevents thundering herd, respects rate limits
Bulkhead isolationSemaphore-based tiersSimple, efficient, prioritizes critical operations
LLM fallbackModel chain with cacheGraceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

  • Prevent cascade failures when external services fail
  • Automatically recover when services come back online
  • Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

  • Isolate failures to prevent entire system crashes
  • Prioritize critical operations over optional ones
  • Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

  • Handle transient failures automatically
  • Avoid overwhelming recovering services
  • Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

  • Handle LLM API rate limits gracefully
  • Fall back to alternative models when primary fails
  • Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

  • Determine which errors should be retried
  • Categorize errors by severity and recoverability
  • Map HTTP status codes to resilience actions

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon