incident-response

Name: incident-response
Rating: 65
Author: 5dlabs

by 5dlabs

Cognitive Task Orchestrator - GitOps on Bare Metal or Cloud for AI Agents

⭐ 2🍴 1📅 Jan 24, 2026

ai-agents ai-powered-development autonomous-development code-generation devops-automation github-automation kubernetes-operator mcp-protocol

View on GitHub Run in Manus

SKILL.md

name: incident-response description: Incident response and remediation patterns including observability, diagnosis, and targeted fixes. agents: [rex, grizz, nova, blaze, tap, spark, bolt, morgan] triggers: [healer, incident, alert, production issue, remediation, diagnosis]

Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

Investigate - Gather metrics, logs, and system state
Diagnose - Identify root cause before fixing
Fix - Implement minimal targeted fix
Validate - Confirm metrics improve after deployment
Document - Store learnings for future incidents

Tool Usage Priority

Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
Kubernetes Tools - Check pod status, events, deployments
ArgoCD Tools - Verify GitOps sync status
Memory Search - Look for similar past incidents
Code Fix - Implement minimal targeted fix

Observability Queries

Prometheus Metrics

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# Latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

# Memory usage
container_memory_working_set_bytes{pod=~"app-.*"}

Loki Log Queries

# Errors in last hour
{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"

# Stack traces
{namespace="production"} |= "panic" or |= "stack trace"

# Slow requests
{namespace="production"} | json | latency_ms > 1000

Kubernetes Diagnostics

# Pod status and events
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl get events -n production --sort-by='.lastTimestamp'

# Logs
kubectl logs -n production -l app=myapp --tail=100
kubectl logs -n production <pod-name> --previous  # Previous container

# Resource usage
kubectl top pods -n production
kubectl top nodes

# Deployment status
kubectl rollout status deployment/myapp -n production
kubectl rollout history deployment/myapp -n production

ArgoCD Status

# Application status
argocd app get myapp
argocd app diff myapp

# Sync status
argocd app sync myapp --dry-run

# Rollback
argocd app rollback myapp <revision>

Common Issues and Solutions

High Error Rate

Check recent deployments
Review error logs for patterns
Check dependency health
Verify configuration changes

High Latency

Check database query performance
Review external service latency
Check resource constraints (CPU/memory)
Look for lock contention

OOMKilled Pods

Increase memory limits
Check for memory leaks
Review recent code changes
Consider horizontal scaling

CrashLoopBackOff

Check logs for startup errors
Verify secrets and configs exist
Check health check endpoints
Review recent deployments

ImagePullBackOff

Verify image exists in registry
Check image pull secrets
Verify image tag is correct
Check registry connectivity

Healing Guidelines

Diagnose first - Understand the root cause before fixing
Minimal changes - Fix only what's broken
Document findings - Store learnings in memory for future incidents
Validate fix - Confirm metrics improve after deployment
Rollback if needed - Don't hesitate to rollback if fix doesn't work

Post-Incident

Update metrics/alerts if needed
Document root cause and fix
Store learnings in memory for similar incidents
Consider preventive measures
Update runbooks if applicable

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

incident-response

SKILL.md

name: incident-response description: Incident response and remediation patterns including observability, diagnosis, and targeted fixes. agents: [rex, grizz, nova, blaze, tap, spark, bolt, morgan] triggers: [healer, incident, alert, production issue, remediation, diagnosis]

Incident Response and Remediation

Healer Mode Workflow

Tool Usage Priority

Observability Queries

Prometheus Metrics

Loki Log Queries

Kubernetes Diagnostics

ArgoCD Status

Common Issues and Solutions

High Error Rate

High Latency

OOMKilled Pods

CrashLoopBackOff

ImagePullBackOff

Healing Guidelines

Post-Incident

Score

Reviews

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow

incident-response

SKILL.md

name: incident-response description: Incident response and remediation patterns including observability, diagnosis, and targeted fixes. agents: [rex, grizz, nova, blaze, tap, spark, bolt, morgan] triggers: [healer, incident, alert, production issue, remediation, diagnosis]

Incident Response and Remediation

Healer Mode Workflow

Tool Usage Priority

Observability Queries

Prometheus Metrics

Loki Log Queries

Kubernetes Diagnostics

ArgoCD Status

Common Issues and Solutions

High Error Rate

High Latency

OOMKilled Pods

CrashLoopBackOff

ImagePullBackOff

Healing Guidelines

Post-Incident

Score

Reviews

Related

Related Skills

greeter

pr-creator

code-reviewer

skill-creator

browser-use

git-workflow