スキル一覧に戻る
5dlabs

incident-response

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

2🍴 1📅 2026年1月24日
GitHubで見るManusで実行

SKILL.md


name: incident-response description: Incident response and remediation patterns including observability, diagnosis, and targeted fixes. agents: [rex, grizz, nova, blaze, tap, spark, bolt, morgan] triggers: [healer, incident, alert, production issue, remediation, diagnosis]

Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

  1. Investigate - Gather metrics, logs, and system state
  2. Diagnose - Identify root cause before fixing
  3. Fix - Implement minimal targeted fix
  4. Validate - Confirm metrics improve after deployment
  5. Document - Store learnings for future incidents

Tool Usage Priority

  1. Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
  2. Kubernetes Tools - Check pod status, events, deployments
  3. ArgoCD Tools - Verify GitOps sync status
  4. Memory Search - Look for similar past incidents
  5. Code Fix - Implement minimal targeted fix

Observability Queries

Prometheus Metrics

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

# Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}

Loki Log Queries

# Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"

# Stack traces
{namespace="production"} |= "panic" or |= "stack trace"

# Slow requests
{namespace="production"} | json | latency_ms > 1000

Kubernetes Diagnostics

# Pod status and events
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl get events -n production --sort-by='.lastTimestamp'

# Logs
kubectl logs -n production -l app=myapp --tail=100
kubectl logs -n production <pod-name> --previous  # Previous container

# Resource usage
kubectl top pods -n production
kubectl top nodes

# Deployment status
kubectl rollout status deployment/myapp -n production
kubectl rollout history deployment/myapp -n production

ArgoCD Status

# Application status
argocd app get myapp
argocd app diff myapp

# Sync status
argocd app sync myapp --dry-run

# Rollback
argocd app rollback myapp <revision>

Common Issues and Solutions

High Error Rate

  1. Check recent deployments
  2. Review error logs for patterns
  3. Check dependency health
  4. Verify configuration changes

High Latency

  1. Check database query performance
  2. Review external service latency
  3. Check resource constraints (CPU/memory)
  4. Look for lock contention

OOMKilled Pods

  1. Increase memory limits
  2. Check for memory leaks
  3. Review recent code changes
  4. Consider horizontal scaling

CrashLoopBackOff

  1. Check logs for startup errors
  2. Verify secrets and configs exist
  3. Check health check endpoints
  4. Review recent deployments

ImagePullBackOff

  1. Verify image exists in registry
  2. Check image pull secrets
  3. Verify image tag is correct
  4. Check registry connectivity

Healing Guidelines

  • Diagnose first - Understand the root cause before fixing
  • Minimal changes - Fix only what's broken
  • Document findings - Store learnings in memory for future incidents
  • Validate fix - Confirm metrics improve after deployment
  • Rollback if needed - Don't hesitate to rollback if fix doesn't work

Post-Incident

  1. Update metrics/alerts if needed
  2. Document root cause and fix
  3. Store learnings in memory for similar incidents
  4. Consider preventive measures
  5. Update runbooks if applicable

スコア

総合スコア

65/100

リポジトリの品質指標に基づく評価

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 100以上

0/15
最近の活動

3ヶ月以内に更新

+5
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

レビュー

💬

レビュー機能は近日公開予定です