Back to list
rjmurillo

chaos-experiment

by rjmurillo

Multi-agent system for software development

5🍴 0📅 Jan 24, 2026

SKILL.md


name: chaos-experiment description: Design and document chaos engineering experiments. Guide steady state baseline, hypothesis formation, failure injection plans, and results analysis. Use for resilience testing, game days, failure injection experiments, and building confidence in system stability. license: MIT metadata: version: 1.0.0 model: claude-sonnet-4-5

Chaos Experiment Designer

Design rigorous chaos engineering experiments that build confidence in system resilience.

Triggers

  • "chaos experiment"
  • "test resilience"
  • "failure injection"
  • "resilience testing"
  • "game day"
  • "chaos engineering"

Quick Reference

PhasePurposeOutput
1. ScopeDefine system boundaries and objectivesSystem under test, success criteria
2. BaselineEstablish steady state metricsQuantified normal behavior
3. HypothesisForm falsifiable hypothesisClear prediction statement
4. InjectionDesign failure scenariosInjection plan with blast radius
5. ExecuteRun controlled experimentObservation log
6. AnalyzeCompare actual vs expectedFindings and action items

Core Principles

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.

The Five Principles

  1. Steady State Focus: Measure observable outputs (throughput, error rates, latency percentiles), not internal metrics
  2. Real-World Variables: Introduce disruptions that simulate actual failure modes
  3. Production Testing: Experiment on live systems with real traffic patterns
  4. Continuous Automation: Build experiments into CI/CD pipelines
  5. Blast Radius Containment: Minimize customer impact through careful scoping

Process

Phase 1: Scope Definition

Define the experiment boundaries.

Inputs: System architecture, historical incidents, monitoring data

Questions to Answer:

  1. What system or subsystem will we test?
  2. What is our business justification for this experiment?
  3. Who are the stakeholders and who must approve?
  4. What is the maximum acceptable customer impact?
  5. What time window is safest for execution?

Output: Scoped experiment definition with stakeholder sign-off

Phase 2: Establish Baseline

Quantify normal system behavior.

Collect Steady State Metrics:

Metric CategoryExamplesCollection Period
ThroughputRequests/second, transactions/minute7-30 days
Error Rates4xx rate, 5xx rate, exception count7-30 days
LatencyP50, P95, P99 response times7-30 days
ResourceCPU%, Memory%, Disk I/O, Network I/O7-30 days
BusinessOrders/hour, active sessions, conversion rate7-30 days

Define Tolerance Thresholds:

  • Green: Within normal variance (baseline +/- 1 standard deviation)
  • Yellow: Elevated but acceptable (baseline +/- 2 standard deviations)
  • Red: Unacceptable degradation (exceeds 2 standard deviations)

Output: Baseline document with metric values and thresholds

Phase 3: Form Hypothesis

Create a falsifiable hypothesis.

Hypothesis Template:

Given [system in steady state],
When [specific failure is injected],
Then [system behavior remains within tolerance]
Because [specific resilience mechanism exists].

Example Hypotheses:

  • "Given our API gateway in steady state, when we terminate 50% of backend instances, then P99 latency remains under 500ms because auto-scaling will provision replacements within 60 seconds."
  • "Given our payment service in steady state, when we introduce 500ms network latency to the database, then order completion rate remains above 99% because connection pooling and retry logic handle transient delays."

Hypothesis Quality Checklist:

  • Specific failure mode identified
  • Quantifiable success criteria defined
  • Underlying resilience mechanism named
  • Timeframe for expected recovery stated

Output: Documented hypothesis with measurable predictions

Phase 4: Design Injection Plan

Plan the controlled failure injection.

Common Failure Categories:

CategoryExamplesTools
Instance FailureKill process, terminate VM, evict podchaos-monkey, kill, kubectl delete
NetworkPartition, latency, packet loss, DNS failuretc, iptables, toxiproxy, chaos-mesh
Resource ExhaustionCPU spike, memory pressure, disk fillstress-ng, dd, memory hogs
DependencyExternal service unavailable, slow responsefault injection proxy, mock services
TimeClock skew, NTP failurefaketime, chrony manipulation
StateData corruption, cache invalidationCustom scripts

Injection Plan Elements:

  1. Failure Type: Precise description of what will be broken
  2. Injection Method: Tool and exact commands to use
  3. Scope: Which instances/services/regions affected
  4. Duration: How long the failure persists
  5. Ramp-up: Gradual vs immediate injection
  6. Rollback: How to instantly restore normal operation

Blast Radius Containment:

  • Start with smallest possible scope (single instance)
  • Use canary deployment pattern for experiments
  • Define automatic abort criteria
  • Have rollback ready before starting
  • Notify on-call before and after

Output: Detailed injection plan with rollback procedures

Phase 5: Execute Experiment

Run the controlled experiment.

Pre-Execution Checklist:

  • Stakeholders notified
  • On-call team aware
  • Monitoring dashboards ready
  • Rollback procedure tested
  • Customer support briefed (for production)
  • Automatic abort criteria configured

During Execution:

  1. Record experiment start timestamp
  2. Monitor all baseline metrics in real-time
  3. Log observations with timestamps
  4. If abort criteria met, execute rollback immediately
  5. Record experiment end timestamp

Observation Log Format:

[HH:MM:SS] - [Metric/Event]: [Value/Description]
[00:00:00] - Experiment started: Injected 500ms latency to database connection
[00:00:15] - P99 latency: 450ms -> 650ms
[00:00:30] - Circuit breaker: OPEN on database connection pool
[00:01:00] - Retry queue depth: 0 -> 247
[00:01:30] - Auto-recovery initiated
[00:02:00] - P99 latency: 650ms -> 480ms
[00:02:30] - Circuit breaker: CLOSED
[00:03:00] - Experiment ended: Removed latency injection

Output: Timestamped observation log

Phase 6: Analyze Results

Compare actual behavior against hypothesis.

Analysis Questions:

  1. Did system behavior stay within tolerance thresholds?
  2. Did resilience mechanisms activate as expected?
  3. What was the actual recovery time?
  4. Were there any unexpected cascading effects?
  5. Did monitoring and alerting work correctly?

Verdict Options:

VerdictMeaningAction
VALIDATEDHypothesis confirmedDocument and expand scope
INVALIDATEDHypothesis falsifiedFile bugs, prioritize fixes
INCONCLUSIVEUnable to determineRefine experiment design

Finding Categories:

  • Resilience Strengths: Mechanisms that worked as designed
  • Weaknesses Discovered: Gaps in resilience that need fixing
  • Monitoring Gaps: Missing visibility during incident
  • Documentation Gaps: Runbooks or procedures that need updating
  • Unexpected Behaviors: System responses not predicted

Output: Analysis document with prioritized action items

Scripts

ScriptPurposeUsage
generate_experiment.pyCreate experiment document from inputspython scripts/generate_experiment.py --name "API Gateway Resilience"
validate_experiment.pyValidate experiment document completenesspython scripts/validate_experiment.py path/to/experiment.md

Exit Codes

CodeMeaning
0Success
1General failure
2Invalid arguments
10Validation failure (missing required sections)

Output Directory

Experiments are saved to: .agents/chaos/

.agents/chaos/
  YYYY-MM-DD-experiment-name.md
  YYYY-MM-DD-experiment-name-results.md

Anti-Patterns

AvoidWhyInstead
Testing in staging onlyProduction has different traffic patternsStart small in production
No rollback planCannot recover if things go wrongDefine rollback before starting
Vague hypothesisCannot determine successUse quantifiable predictions
Measuring internal metrics onlyDo not reflect customer experienceFocus on observable outputs
Big bang experimentsBlast radius too largeStart with smallest scope
No baselineCannot compare resultsCollect 7+ days of metrics first
Skipping stakeholder buy-inCreates political problemsGet approval before execution

Templates

Experiment Document Template

Use templates/experiment-template.md or generate with:

python scripts/generate_experiment.py \
  --name "Database Failover Resilience" \
  --system "Payment Service" \
  --owner "Jane Smith" \
  --output .agents/chaos/

Verification Checklist

Before executing any chaos experiment:

  • Scope clearly defined with business justification
  • Baseline metrics collected (minimum 7 days)
  • Hypothesis is falsifiable with quantifiable criteria
  • Injection plan includes specific tools and commands
  • Blast radius is contained to acceptable scope
  • Rollback procedure is documented and tested
  • Stakeholders have approved the experiment
  • On-call team is aware of timing
  • Monitoring dashboards are ready
  • Results template is prepared

Extension Points

  1. Failure Categories: Add new failure types to Phase 4 table
  2. Tools Integration: Extend scripts to integrate with chaos-mesh, Gremlin, LitmusChaos
  3. Automation: Integrate with CI/CD for continuous chaos testing
  4. Metrics Sources: Add integrations for Prometheus, Datadog, New Relic
  5. Scheduling: Add calendar integration for recurring game days
SkillRelationship
securitySecurity review for production experiments
devopsCI/CD integration for automated chaos
qaTest strategy alignment
analystRoot cause analysis of findings

Score

Total Score

60/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

0/5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon