Back to list
zircote

observability-standards

by zircote

Software Development Lifecycle standards plugin for AI coding assistants. Enforces build, quality, testing, CI/CD, security, and documentation best practices.

0🍴 0📅 Jan 21, 2026

SKILL.md


name: Observability Standards description: This skill should be used when the user asks about "observability", "logging", "metrics", "tracing", "monitoring", "structured logging", "log format", "log levels", "distributed tracing", "OpenTelemetry", "health checks", or needs guidance on implementing observability and monitoring requirements. version: 1.0.0

Observability Standards

Guidance for implementing observability requirements including logging, metrics, tracing, and monitoring configuration.

Tooling

Available Tools: If using Claude Code, the agents:sre-engineer agent specializes in observability setup and SLO management. The agents:devops-engineer agent can help configure monitoring infrastructure.

Logging Requirements

Structured Logging (MUST)

All logging MUST use structured format (JSON preferred):

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "message": "Request processed",
  "service": "api-gateway",
  "trace_id": "abc123",
  "span_id": "def456",
  "duration_ms": 45,
  "status_code": 200
}

Log Levels (MUST)

Use consistent log levels with defined semantics:

LevelPurposeWhen to Use
ERRORErrors requiring attentionFailures, exceptions
WARNPotential issuesDegraded performance, retries
INFOSignificant eventsRequest completion, state changes
DEBUGDetailed informationDevelopment, troubleshooting
TRACEVery detailed tracingDeep debugging

Required Log Fields (MUST)

All log entries MUST include:

FieldDescription
timestampISO 8601 format with timezone
levelLog severity level
messageHuman-readable description
serviceService/application name

Log entries SHOULD include when applicable:

FieldDescription
trace_idDistributed trace identifier
span_idSpan identifier
user_idUser identifier (if authenticated)
request_idRequest correlation ID
duration_msOperation duration

Sensitive Data (MUST NOT)

Logs MUST NOT contain:

  • Passwords or secrets
  • API keys or tokens
  • Personal identifiable information (PII)
  • Credit card numbers
  • Session tokens

Metrics Requirements

Metric Types (MUST)

Implement appropriate metric types:

TypePurposeExamples
CounterCumulative valuesRequest count, errors
GaugeCurrent valuesQueue size, connections
HistogramValue distributionsResponse time, payload size
SummaryQuantile calculationsP50, P95, P99 latencies

Required Metrics (MUST)

Services MUST expose:

MetricTypeDescription
requests_totalCounterTotal requests by endpoint/status
request_duration_secondsHistogramRequest latency
errors_totalCounterError count by type
active_connectionsGaugeCurrent connections

Metric Naming (MUST)

Follow naming conventions:

# Format: <namespace>_<name>_<unit>
http_requests_total
http_request_duration_seconds
db_connections_active
cache_hits_total

Metric Labels (SHOULD)

Use consistent label naming:

LabelDescription
serviceService name
endpointAPI endpoint
methodHTTP method
statusResponse status
error_typeError classification

Distributed Tracing

Tracing Implementation (SHOULD)

Implement distributed tracing using OpenTelemetry:

# OpenTelemetry configuration
exporters:
  otlp:
    endpoint: "collector:4317"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Trace Context (MUST)

When tracing is implemented, propagate context:

HeaderStandard
traceparentW3C Trace Context
tracestateW3C Trace Context
X-Request-IDRequest correlation

Span Requirements (SHOULD)

Spans SHOULD include:

  • Operation name
  • Start/end timestamps
  • Status (OK, ERROR)
  • Relevant attributes
  • Error details (if applicable)

Health Checks

Health Endpoints (MUST)

Services MUST expose health endpoints:

EndpointPurposeResponse
/healthBasic health200 OK or 503
/health/liveLiveness probe200 if running
/health/readyReadiness probe200 if ready to serve

Health Response Format (MUST)

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 5
    },
    "cache": {
      "status": "healthy",
      "latency_ms": 1
    }
  },
  "version": "1.2.3",
  "uptime_seconds": 3600
}

Dependency Checks (SHOULD)

Health checks SHOULD verify:

  • Database connectivity
  • Cache availability
  • External service reachability
  • Disk space adequacy
  • Memory availability

Alerting

Alert Configuration (MUST)

Define alerts for critical conditions:

ConditionSeverityResponse
Service downCriticalImmediate page
Error rate > 5%HighPage within 5 min
Latency P95 > 1sMediumNotify team
Disk > 80%WarningCreate ticket

Alert Requirements (MUST)

Alerts MUST include:

  • Clear description of condition
  • Severity level
  • Runbook link
  • Affected service/component

Implementation Checklist

  • Configure structured logging
  • Define log level policies
  • Implement required metrics
  • Set up health endpoints
  • Configure distributed tracing (if applicable)
  • Define alerting rules
  • Create runbooks for alerts
  • Verify sensitive data exclusion

Compliance Verification

# Verify structured log output
app_command 2>&1 | jq .

# Check health endpoint
curl -s http://localhost:8080/health | jq .

# Verify metrics endpoint
curl -s http://localhost:8080/metrics | grep -E "^(http_|app_)"

# Check for sensitive data in logs
grep -r -i "password\|secret\|token" logs/ | wc -l
# Should be 0

Language-Specific Logging

Rust (tracing)

use tracing::{info, instrument};

#[instrument]
fn process_request(id: &str) {
    info!(request_id = %id, "Processing request");
}

TypeScript (pino)

import pino from "pino";

const logger = pino({
  level: "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
});

logger.info({ requestId, duration }, "Request processed");

Python (structlog)

import structlog

logger = structlog.get_logger()
logger.info("request_processed", request_id=request_id, duration=duration)

Additional Resources

Reference Files

  • references/logging-config.md - Logging configuration guide
  • references/metrics-guide.md - Metrics implementation patterns

Examples

  • examples/otel-config.yaml - OpenTelemetry configuration
  • examples/alerting-rules.yaml - Prometheus alerting rules

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon