Back to list
mjunaidca

operating-production-services

by mjunaidca

A curated collection of Agent Skills — reusable units of intelligence that teach AI General Agents how to perform specific tasks autonomously.

18🍴 9📅 Jan 23, 2026

SKILL.md


name: operating-production-services description: | SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

NeedGo To
Define reliability targetsSLOs & Error Budgets
Write incident reportPostmortem Templates
Set up SLO alertingreferences/slo-alerting.md

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43 minutes8.76 hours
99.95%22 minutes4.38 hours
99.99%4.3 minutes52 minutes

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy:

Budget RemainingAction
> 50%Normal velocity
10-50%Postpone risky changes
< 10%Freeze non-critical changes
0%Feature freeze, fix reliability

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.


Postmortem Templates

The Blameless Principle

Blame-FocusedBlameless
"Who caused this?""What conditions allowed this?"
Punish individualsImprove systems
Hide informationShare learnings

When to Write Postmortems

  • SEV1/SEV2 incidents
  • Customer-facing outages > 15 minutes
  • Data loss or security incidents
  • Near-misses that could have been severe
  • Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

  1. Opening (5 min) - Remind: "We're here to learn, not blame"
  2. Timeline (15 min) - Walk through events chronologically
  3. Analysis (20 min) - What failed? Why? What allowed it?
  4. Action Items (15 min) - Prioritize, assign owners, set dates
  5. Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

  • Redirect blame to systems: "What made this mistake possible?"
  • Time-box tangents
  • Document dissenting views
  • Encourage quiet participants

Anti-Patterns

Don'tDo Instead
Aim for 100% SLOAccept error budget exists
Skip small incidentsSmall incidents reveal patterns
Orphan action itemsEvery item needs owner + date + ticket
Blame individualsAsk "what conditions allowed this?"
Create busywork actionsActions should prevent recurrence

Verification

Run: python scripts/verify.py

References

Score

Total Score

65/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

0/10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon