← Back to list

incident-response
by 5dlabs
Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents
⭐ 2🍴 1📅 Jan 24, 2026
SKILL.md
name: incident-response description: Incident response and remediation patterns including observability, diagnosis, and targeted fixes. agents: [rex, grizz, nova, blaze, tap, spark, bolt, morgan] triggers: [healer, incident, alert, production issue, remediation, diagnosis]
Incident Response and Remediation
Patterns for diagnosing and fixing production issues.
Healer Mode Workflow
- Investigate - Gather metrics, logs, and system state
- Diagnose - Identify root cause before fixing
- Fix - Implement minimal targeted fix
- Validate - Confirm metrics improve after deployment
- Document - Store learnings for future incidents
Tool Usage Priority
- Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
- Kubernetes Tools - Check pod status, events, deployments
- ArgoCD Tools - Verify GitOps sync status
- Memory Search - Look for similar past incidents
- Code Fix - Implement minimal targeted fix
Observability Queries
Prometheus Metrics
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)
# Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}
Loki Log Queries
# Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"
# Stack traces
{namespace="production"} |= "panic" or |= "stack trace"
# Slow requests
{namespace="production"} | json | latency_ms > 1000
Kubernetes Diagnostics
# Pod status and events
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl get events -n production --sort-by='.lastTimestamp'
# Logs
kubectl logs -n production -l app=myapp --tail=100
kubectl logs -n production <pod-name> --previous # Previous container
# Resource usage
kubectl top pods -n production
kubectl top nodes
# Deployment status
kubectl rollout status deployment/myapp -n production
kubectl rollout history deployment/myapp -n production
ArgoCD Status
# Application status
argocd app get myapp
argocd app diff myapp
# Sync status
argocd app sync myapp --dry-run
# Rollback
argocd app rollback myapp <revision>
Common Issues and Solutions
High Error Rate
- Check recent deployments
- Review error logs for patterns
- Check dependency health
- Verify configuration changes
High Latency
- Check database query performance
- Review external service latency
- Check resource constraints (CPU/memory)
- Look for lock contention
OOMKilled Pods
- Increase memory limits
- Check for memory leaks
- Review recent code changes
- Consider horizontal scaling
CrashLoopBackOff
- Check logs for startup errors
- Verify secrets and configs exist
- Check health check endpoints
- Review recent deployments
ImagePullBackOff
- Verify image exists in registry
- Check image pull secrets
- Verify image tag is correct
- Check registry connectivity
Healing Guidelines
- Diagnose first - Understand the root cause before fixing
- Minimal changes - Fix only what's broken
- Document findings - Store learnings in memory for future incidents
- Validate fix - Confirm metrics improve after deployment
- Rollback if needed - Don't hesitate to rollback if fix doesn't work
Post-Incident
- Update metrics/alerts if needed
- Document root cause and fix
- Store learnings in memory for similar incidents
- Consider preventive measures
- Update runbooks if applicable
Score
Total Score
65/100
Based on repository quality metrics
✓SKILL.md
SKILL.mdファイルが含まれている
+20
✓LICENSE
ライセンスが設定されている
+10
○説明文
100文字以上の説明がある
0/10
○人気
GitHub Stars 100以上
0/15
✓最近の活動
1ヶ月以内に更新
+10
○フォーク
10回以上フォークされている
0/5
✓Issue管理
オープンIssueが50未満
+5
✓言語
プログラミング言語が設定されている
+5
✓タグ
1つ以上のタグが設定されている
+5
Reviews
💬
Reviews coming soon


