スキル一覧に戻る
htlin222

incident

by htlin222

incidentは、other分野における実用的なスキルです。複雑な課題への対応力を強化し、業務効率と成果の質を改善します。

66🍴 4📅 2026年1月23日
GitHubで見るManusで実行

SKILL.md


name: incident description: Handle production incidents with urgency. Use when production issues occur for debugging, fixes, and post-mortems.

Incident Response

Handle production incidents systematically.

When to Use

  • Production is down or degraded
  • Critical errors affecting users
  • Security incidents
  • Data issues
  • Performance emergencies

Incident Workflow

DETECT → TRIAGE → MITIGATE → RESOLVE → REVIEW

1. Detect & Triage

# Quick health checks
curl -s https://api.example.com/health | jq .
kubectl get pods -n production | grep -v Running

# Check recent deployments
git log --oneline -5
kubectl rollout history deployment/app

# Error rates
grep -c "ERROR" /var/log/app.log

2. Mitigate First

Priority: Stop the bleeding before finding root cause

# Rollback deployment
kubectl rollout undo deployment/app

# Scale up if overloaded
kubectl scale deployment/app --replicas=10

# Feature flag disable
curl -X POST api.example.com/admin/flags -d '{"feature": false}'

# Circuit breaker
# Block problematic endpoint or dependency

3. Investigate

# Recent logs
kubectl logs -l app=myapp --since=30m | grep -i error

# Resource usage
kubectl top pods -n production

# Database connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Network issues
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com

Severity Levels

LevelImpactResponse TimeExample
P1Complete outageImmediateSite down
P2Major feature broken15 minPayments failing
P3Minor feature broken1 hourSearch slow
P4Low impactNext dayUI glitch

Communication Template

## Incident Update

**Status:** Investigating | Identified | Mitigated | Resolved
**Severity:** P1/P2/P3
**Started:** YYYY-MM-DD HH:MM UTC
**Duration:** X hours

### Summary

[1-2 sentences on what's happening]

### Impact

[Who is affected and how]

### Current Actions

- [Action 1]
- [Action 2]

### Next Update

[Time of next update]

Post-Mortem Template

## Incident Post-Mortem

**Date:** YYYY-MM-DD
**Duration:** X hours
**Severity:** P1

### Summary

[What happened in 2-3 sentences]

### Timeline

- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause

[Technical explanation]

### Impact

- Users affected: X
- Revenue impact: $Y
- Data loss: None/Describe

### Action Items

| Action                  | Owner | Due Date   |
| ----------------------- | ----- | ---------- |
| Add monitoring for X    | @name | YYYY-MM-DD |
| Improve circuit breaker | @name | YYYY-MM-DD |

### Lessons Learned

- [What we learned]

Examples

Input: "API is returning 500 errors" Action: Check logs, identify failing component, rollback if recent deploy, fix

Input: "Database is overloaded" Action: Kill long queries, scale read replicas, optimize or cache hot queries

スコア

総合スコア

55/100

リポジトリの品質指標に基づく評価

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

0/10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

レビュー

💬

レビュー機能は近日公開予定です