operating-production-services

Name: operating-production-services
Rating: 65
Author: mjunaidca

by mjunaidca

A curated collection of Agent Skills — reusable units of intelligence that teach AI General Agents how to perform specific tasks autonomously.

⭐ 18🍴 9📅 Jan 23, 2026

agent agent-skills agentskills ai-agents-for-business claudecode-config claudecode-hooks claudecode-monitoring

View on GitHub Run in Manus

SKILL.md

name: operating-production-services description: | SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

Need	Go To
Define reliability targets	SLOs & Error Budgets
Write incident report	Postmortem Templates
Set up SLO alerting	references/slo-alerting.md

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

SLO %	Downtime/Month	Downtime/Year
99%	7.2 hours	3.65 days
99.9%	43 minutes	8.76 hours
99.95%	22 minutes	4.38 hours
99.99%	4.3 minutes	52 minutes

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy:

Budget Remaining	Action
> 50%	Normal velocity
10-50%	Postpone risky changes
< 10%	Freeze non-critical changes
0%	Feature freeze, fix reliability

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
Punish individuals	Improve systems
Hide information	Share learnings

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

Don't	Do Instead
Aim for 100% SLO	Accept error budget exists
Skip small incidents	Small incidents reveal patterns
Orphan action items	Every item needs owner + date + ticket
Blame individuals	Ask "what conditions allowed this?"
Create busywork actions	Actions should prevent recurrence

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

○最近の活動

3ヶ月以内に更新がある

0/10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

operating-production-services

SKILL.md

Operating Production Services

Quick Reference

SLOs & Error Budgets

The Hierarchy

Common SLIs

SLO Targets Reality Check

Error Budget

Postmortem Templates

The Blameless Principle

When to Write Postmortems

Standard Template

Quick Template (Minor Incidents)

Postmortem Meeting Guide

Structure (60 min)

Facilitation Tips

Anti-Patterns

Verification

References

Score

Reviews

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

operating-production-services

SKILL.md

Operating Production Services

Quick Reference

SLOs & Error Budgets

The Hierarchy

Common SLIs

SLO Targets Reality Check

Error Budget

Postmortem Templates

The Blameless Principle

When to Write Postmortems

Standard Template

Quick Template (Minor Incidents)

Postmortem Meeting Guide

Structure (60 min)

Facilitation Tips

Anti-Patterns

Verification

References

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices