
chaos-experiment
by rjmurillo
Multi-agent system for software development
SKILL.md
name: chaos-experiment description: Design and document chaos engineering experiments. Guide steady state baseline, hypothesis formation, failure injection plans, and results analysis. Use for resilience testing, game days, failure injection experiments, and building confidence in system stability. license: MIT metadata: version: 1.0.0 model: claude-sonnet-4-5
Chaos Experiment Designer
Design rigorous chaos engineering experiments that build confidence in system resilience.
Triggers
- "chaos experiment"
- "test resilience"
- "failure injection"
- "resilience testing"
- "game day"
- "chaos engineering"
Quick Reference
| Phase | Purpose | Output |
|---|---|---|
| 1. Scope | Define system boundaries and objectives | System under test, success criteria |
| 2. Baseline | Establish steady state metrics | Quantified normal behavior |
| 3. Hypothesis | Form falsifiable hypothesis | Clear prediction statement |
| 4. Injection | Design failure scenarios | Injection plan with blast radius |
| 5. Execute | Run controlled experiment | Observation log |
| 6. Analyze | Compare actual vs expected | Findings and action items |
Core Principles
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
The Five Principles
- Steady State Focus: Measure observable outputs (throughput, error rates, latency percentiles), not internal metrics
- Real-World Variables: Introduce disruptions that simulate actual failure modes
- Production Testing: Experiment on live systems with real traffic patterns
- Continuous Automation: Build experiments into CI/CD pipelines
- Blast Radius Containment: Minimize customer impact through careful scoping
Process
Phase 1: Scope Definition
Define the experiment boundaries.
Inputs: System architecture, historical incidents, monitoring data
Questions to Answer:
- What system or subsystem will we test?
- What is our business justification for this experiment?
- Who are the stakeholders and who must approve?
- What is the maximum acceptable customer impact?
- What time window is safest for execution?
Output: Scoped experiment definition with stakeholder sign-off
Phase 2: Establish Baseline
Quantify normal system behavior.
Collect Steady State Metrics:
| Metric Category | Examples | Collection Period |
|---|---|---|
| Throughput | Requests/second, transactions/minute | 7-30 days |
| Error Rates | 4xx rate, 5xx rate, exception count | 7-30 days |
| Latency | P50, P95, P99 response times | 7-30 days |
| Resource | CPU%, Memory%, Disk I/O, Network I/O | 7-30 days |
| Business | Orders/hour, active sessions, conversion rate | 7-30 days |
Define Tolerance Thresholds:
- Green: Within normal variance (baseline +/- 1 standard deviation)
- Yellow: Elevated but acceptable (baseline +/- 2 standard deviations)
- Red: Unacceptable degradation (exceeds 2 standard deviations)
Output: Baseline document with metric values and thresholds
Phase 3: Form Hypothesis
Create a falsifiable hypothesis.
Hypothesis Template:
Given [system in steady state],
When [specific failure is injected],
Then [system behavior remains within tolerance]
Because [specific resilience mechanism exists].
Example Hypotheses:
- "Given our API gateway in steady state, when we terminate 50% of backend instances, then P99 latency remains under 500ms because auto-scaling will provision replacements within 60 seconds."
- "Given our payment service in steady state, when we introduce 500ms network latency to the database, then order completion rate remains above 99% because connection pooling and retry logic handle transient delays."
Hypothesis Quality Checklist:
- Specific failure mode identified
- Quantifiable success criteria defined
- Underlying resilience mechanism named
- Timeframe for expected recovery stated
Output: Documented hypothesis with measurable predictions
Phase 4: Design Injection Plan
Plan the controlled failure injection.
Common Failure Categories:
| Category | Examples | Tools |
|---|---|---|
| Instance Failure | Kill process, terminate VM, evict pod | chaos-monkey, kill, kubectl delete |
| Network | Partition, latency, packet loss, DNS failure | tc, iptables, toxiproxy, chaos-mesh |
| Resource Exhaustion | CPU spike, memory pressure, disk fill | stress-ng, dd, memory hogs |
| Dependency | External service unavailable, slow response | fault injection proxy, mock services |
| Time | Clock skew, NTP failure | faketime, chrony manipulation |
| State | Data corruption, cache invalidation | Custom scripts |
Injection Plan Elements:
- Failure Type: Precise description of what will be broken
- Injection Method: Tool and exact commands to use
- Scope: Which instances/services/regions affected
- Duration: How long the failure persists
- Ramp-up: Gradual vs immediate injection
- Rollback: How to instantly restore normal operation
Blast Radius Containment:
- Start with smallest possible scope (single instance)
- Use canary deployment pattern for experiments
- Define automatic abort criteria
- Have rollback ready before starting
- Notify on-call before and after
Output: Detailed injection plan with rollback procedures
Phase 5: Execute Experiment
Run the controlled experiment.
Pre-Execution Checklist:
- Stakeholders notified
- On-call team aware
- Monitoring dashboards ready
- Rollback procedure tested
- Customer support briefed (for production)
- Automatic abort criteria configured
During Execution:
- Record experiment start timestamp
- Monitor all baseline metrics in real-time
- Log observations with timestamps
- If abort criteria met, execute rollback immediately
- Record experiment end timestamp
Observation Log Format:
[HH:MM:SS] - [Metric/Event]: [Value/Description]
[00:00:00] - Experiment started: Injected 500ms latency to database connection
[00:00:15] - P99 latency: 450ms -> 650ms
[00:00:30] - Circuit breaker: OPEN on database connection pool
[00:01:00] - Retry queue depth: 0 -> 247
[00:01:30] - Auto-recovery initiated
[00:02:00] - P99 latency: 650ms -> 480ms
[00:02:30] - Circuit breaker: CLOSED
[00:03:00] - Experiment ended: Removed latency injection
Output: Timestamped observation log
Phase 6: Analyze Results
Compare actual behavior against hypothesis.
Analysis Questions:
- Did system behavior stay within tolerance thresholds?
- Did resilience mechanisms activate as expected?
- What was the actual recovery time?
- Were there any unexpected cascading effects?
- Did monitoring and alerting work correctly?
Verdict Options:
| Verdict | Meaning | Action |
|---|---|---|
| VALIDATED | Hypothesis confirmed | Document and expand scope |
| INVALIDATED | Hypothesis falsified | File bugs, prioritize fixes |
| INCONCLUSIVE | Unable to determine | Refine experiment design |
Finding Categories:
- Resilience Strengths: Mechanisms that worked as designed
- Weaknesses Discovered: Gaps in resilience that need fixing
- Monitoring Gaps: Missing visibility during incident
- Documentation Gaps: Runbooks or procedures that need updating
- Unexpected Behaviors: System responses not predicted
Output: Analysis document with prioritized action items
Scripts
| Script | Purpose | Usage |
|---|---|---|
generate_experiment.py | Create experiment document from inputs | python scripts/generate_experiment.py --name "API Gateway Resilience" |
validate_experiment.py | Validate experiment document completeness | python scripts/validate_experiment.py path/to/experiment.md |
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General failure |
| 2 | Invalid arguments |
| 10 | Validation failure (missing required sections) |
Output Directory
Experiments are saved to: .agents/chaos/
.agents/chaos/
YYYY-MM-DD-experiment-name.md
YYYY-MM-DD-experiment-name-results.md
Anti-Patterns
| Avoid | Why | Instead |
|---|---|---|
| Testing in staging only | Production has different traffic patterns | Start small in production |
| No rollback plan | Cannot recover if things go wrong | Define rollback before starting |
| Vague hypothesis | Cannot determine success | Use quantifiable predictions |
| Measuring internal metrics only | Do not reflect customer experience | Focus on observable outputs |
| Big bang experiments | Blast radius too large | Start with smallest scope |
| No baseline | Cannot compare results | Collect 7+ days of metrics first |
| Skipping stakeholder buy-in | Creates political problems | Get approval before execution |
Templates
Experiment Document Template
Use templates/experiment-template.md or generate with:
python scripts/generate_experiment.py \
--name "Database Failover Resilience" \
--system "Payment Service" \
--owner "Jane Smith" \
--output .agents/chaos/
Verification Checklist
Before executing any chaos experiment:
- Scope clearly defined with business justification
- Baseline metrics collected (minimum 7 days)
- Hypothesis is falsifiable with quantifiable criteria
- Injection plan includes specific tools and commands
- Blast radius is contained to acceptable scope
- Rollback procedure is documented and tested
- Stakeholders have approved the experiment
- On-call team is aware of timing
- Monitoring dashboards are ready
- Results template is prepared
Extension Points
- Failure Categories: Add new failure types to Phase 4 table
- Tools Integration: Extend scripts to integrate with chaos-mesh, Gremlin, LitmusChaos
- Automation: Integrate with CI/CD for continuous chaos testing
- Metrics Sources: Add integrations for Prometheus, Datadog, New Relic
- Scheduling: Add calendar integration for recurring game days
Related Resources
- Principles of Chaos Engineering
- Chaos Monkey (Netflix)
- Chaos Mesh (CNCF)
- LitmusChaos (CNCF)
- Gremlin (Commercial)
Related Skills
| Skill | Relationship |
|---|---|
| security | Security review for production experiments |
| devops | CI/CD integration for automated chaos |
| qa | Test strategy alignment |
| analyst | Root cause analysis of findings |
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon

