Back to list
mattgierhart

prd-v08-monitoring-setup

by mattgierhart

PRD-driven Context Engineering: A systematic approach to building AI-powered products using progressive documentation and context-aware development workflows

9🍴 2📅 Jan 24, 2026

SKILL.md


name: prd-v08-monitoring-setup description: > Define monitoring strategy, metrics collection, and alerting thresholds during PRD v0.8 Deployment & Ops. Triggers on requests to set up monitoring, define alerts, or when user asks "what should we monitor?", "alerting strategy", "observability", "metrics", "SLOs", "dashboards", "monitoring setup". Outputs MON- entries with monitoring rules and alert configurations.

Monitoring Setup

Position in workflow: v0.8 Runbook Creation → v0.8 Monitoring Setup → v0.9 GTM Strategy

Purpose

Define what to measure, when to alert, and how to visualize system health—creating the observability foundation that enables rapid incident detection and resolution.

Core Concept: Monitoring as Early Warning

Monitoring is not about collecting data—it is about detecting problems before users do. Every metric should answer: "Is this working? If not, what's broken?"

Monitoring Layers

LayerWhat to MeasureWhy It Matters
InfrastructureCPU, memory, disk, networkSystem health foundation
ApplicationLatency, errors, throughputUser-facing performance
BusinessSignups, conversions, revenueProduct health
User ExperiencePage load, interaction timeReal user impact

Execution

  1. Define SLOs (Service Level Objectives)

    • What uptime do we promise?
    • What latency is acceptable?
    • What error rate is tolerable?
  2. Identify key metrics per layer

    • Infrastructure: Resource utilization
    • Application: RED metrics (Rate, Errors, Duration)
    • Business: KPI- from v0.3 and v0.9
    • User: Core Web Vitals, journey completion
  3. Set alert thresholds

    • Warning: Investigate soon
    • Critical: Act immediately
    • Base on SLOs and historical data
  4. Map alerts to runbooks

    • Every critical alert → RUN- procedure
    • No alert without action path
  5. Design dashboards

    • Overview: System health at a glance
    • Deep-dive: Per-service details
    • Business: KPI tracking
  6. Create MON- entries with full traceability

MON- Output Template

MON-XXX: [Monitoring Rule Title]
Type: [Metric | Alert | Dashboard | SLO]
Layer: [Infrastructure | Application | Business | User Experience]
Owner: [Team responsible for this metric/alert]

For Metric Type:
  Name: [metric.name.format]
  Description: [What this measures]
  Unit: [count | ms | percentage | bytes]
  Source: [Where this comes from]
  Aggregation: [avg | sum | p50 | p95 | p99]
  Retention: [How long to keep data]

For Alert Type:
  Metric: [MON-YYY or metric name]
  Condition: [Threshold expression]
  Window: [Time window for evaluation]
  Severity: [Critical | Warning | Info]
  Runbook: [RUN-XXX to follow when fired]
  Notification:
    - Channel: [Slack, PagerDuty, Email]
    - Recipients: [Team or individuals]
  Silencing: [When to suppress, e.g., maintenance windows]

For Dashboard Type:
  Purpose: [What questions this answers]
  Audience: [Who uses this dashboard]
  Panels: [List of visualizations]
  Refresh: [How often to update]

For SLO Type:
  Objective: [What we promise]
  Target: [Percentage, e.g., 99.9%]
  Window: [Rolling 30 days]
  Error Budget: [How much downtime allowed]
  Alerting: [When error budget is at risk]

Linked IDs: [API-XXX, UJ-XXX, KPI-XXX, RUN-XXX related]

Example MON- entries:

MON-001: API Request Latency (p95)
Type: Metric
Layer: Application
Owner: Backend Team

Name: api.request.latency.p95
Description: 95th percentile response time for all API endpoints
Unit: ms
Source: Application APM (Datadog/New Relic)
Aggregation: p95
Retention: 90 days

Linked IDs: API-001 to API-020
MON-002: High Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: > 500ms
Window: 5 minutes
Severity: Warning
Runbook: RUN-006 (Performance Degradation Investigation)

Notification:
  - Channel: Slack #backend-alerts
  - Recipients: Backend on-call

Silencing: During scheduled deployments (DEP-002 windows)

Linked IDs: MON-001, RUN-006, DEP-002
MON-003: Critical Latency Alert
Type: Alert
Layer: Application
Owner: Backend Team

Metric: MON-001 (api.request.latency.p95)
Condition: > 2000ms
Window: 2 minutes
Severity: Critical
Runbook: RUN-006 (Performance Degradation Investigation)

Notification:
  - Channel: PagerDuty
  - Recipients: Backend on-call, Tech Lead

Silencing: None (always alert on critical)

Linked IDs: MON-001, RUN-006
MON-004: API Availability SLO
Type: SLO
Layer: Application
Owner: Platform Team

Objective: API endpoints return non-5xx response
Target: 99.9%
Window: Rolling 30 days
Error Budget: 43.2 minutes/month

Alerting:
  - 50% budget consumed → Warning to engineering
  - 75% budget consumed → Critical, freeze non-essential deploys
  - 100% budget consumed → Incident review required

Linked IDs: API-001 to API-020, DEP-003
MON-005: System Health Dashboard
Type: Dashboard
Layer: Infrastructure + Application
Owner: Platform Team

Purpose: Quick health check for on-call engineers
Audience: On-call, engineering leadership
Panels:
  - API Request Rate (last 1h)
  - API Latency (p50, p95, p99)
  - Error Rate by Endpoint
  - Active Alerts
  - Database Connection Pool
  - CPU/Memory by Service
Refresh: 30 seconds

Linked IDs: MON-001, MON-002, MON-003

The RED Method (Application Monitoring)

For each service, measure:

MetricWhat It MeasuresAlert Threshold
RateRequests per secondAnomaly detection
ErrorsFailed requests / total>1% warning, >5% critical
DurationRequest latency (p95, p99)>500ms warning, >2s critical

The USE Method (Infrastructure Monitoring)

For each resource (CPU, memory, disk, network):

MetricWhat It MeasuresAlert Threshold
Utilization% of capacity used>80% warning, >95% critical
SaturationQueue depth, waiting>0 for critical resources
ErrorsError count/rateAny errors = investigate

SLO Framework

TierAvailabilityLatency (p95)Use For
Tier 199.99% (52 min/yr)<100msPayment, auth
Tier 299.9% (8.7 hr/yr)<500msCore features
Tier 399% (3.6 days/yr)<2sBackground jobs

Alert Severity Matrix

SeverityUser ImpactResponse TimeNotification
CriticalService unusable<5 minPagerDuty (wake up)
WarningDegraded experience<30 minSlack (business hours)
InfoNo immediate impactNext dayDashboard/log

Dashboard Design Principles

PrincipleImplementation
Answer questionsEach panel answers "Is X working?"
HierarchyOverview → Service → Component
ContextShow thresholds, comparisons
ActionableLink to runbooks from alerts
FastQuick load, auto-refresh

Anti-Patterns

PatternSignalFix
Alert fatigueToo many alerts, team ignoresTune thresholds, remove noise
No runbook linkAlert fires, no one knows what to doEvery alert → RUN-
Vanity metrics"1 million requests!" without contextFocus on user-impacting metrics
Missing baselinesNo historical comparisonEstablish baselines before launch
Over-monitoring500 metrics, can't find signalFocus on RED/USE fundamentals
Under-monitoring"We'll add monitoring later"Monitoring ships with code

Quality Gates

Before proceeding to v0.9 GTM Strategy:

  • SLOs defined for critical services (MON- SLO type)
  • RED metrics configured for application layer
  • USE metrics configured for infrastructure layer
  • Critical alerts linked to RUN- procedures
  • Overview dashboard created for on-call
  • Alert notification channels configured
  • Baseline metrics established from staging

Downstream Connections

ConsumerWhat It UsesExample
On-Call TeamMON- alerts trigger responseMON-003 → page engineer
v0.9 Launch MetricsMON- provides baseline dataMON-001 baseline → KPI-010 target
Post-MortemsMON- data for incident analysis"MON-005 showed spike at 14:32"
Capacity PlanningMON- trends inform scalingUSE metrics → infrastructure planning
DEP- RollbackMON- thresholds trigger rollbackMON-002 breach → DEP-003 rollback

Detailed References

  • Monitoring stack examples: See references/monitoring-stack.md
  • MON- entry template: See assets/mon-template.md
  • SLO calculation guide: See references/slo-guide.md
  • Dashboard best practices: See references/dashboard-guide.md

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon