スキル一覧に戻る
yonatangross

observability-monitoring

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 2026年1月23日
GitHubで見るManusで実行

SKILL.md


name: Observability & Monitoring description: Use when adding logging, metrics, tracing, or alerting to applications. Observability & Monitoring covers structured logging, Prometheus metrics, OpenTelemetry tracing, and alerting strategies. tags: [observability, monitoring, metrics, logging, tracing] context: fork agent: metrics-architect version: 1.0.0 category: Operations & Reliability agents: [backend-system-architect, code-quality-reviewer, ai-ml-engineer] keywords: [observability, monitoring, logging, metrics, tracing, alerts, Prometheus, OpenTelemetry] author: OrchestKit user-invocable: false

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

Overview

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

+-----------------+-----------------+-----------------+
|     LOGS        |     METRICS     |     TRACES      |
+-----------------+-----------------+-----------------+
| What happened   | How is system   | How do requests |
| at specific     | performing      | flow through    |
| point in time   | over time       | services        |
+-----------------+-----------------+-----------------+

References

Logging Patterns

See: references/logging-patterns.md

Key topics covered:

  • Correlation IDs for cross-service request tracking
  • Log sampling strategies for high-traffic systems
  • LogQL queries for Loki log aggregation
  • OrchestKit structlog configuration example

Metrics Collection

See: references/metrics-collection.md

Key topics covered:

  • Counter, Gauge, Histogram, Summary metric types
  • Cardinality management and limits
  • Custom business metrics (LLM tokens, cache hit rates)
  • LLM cost tracking with Prometheus

Distributed Tracing

See: references/distributed-tracing.md

Key topics covered:

  • OpenTelemetry setup and auto-instrumentation
  • Span relationships (parent/child, parallel)
  • Head-based and tail-based sampling strategies
  • Trace context propagation across services

Alerting and Dashboards

See: references/alerting-dashboards.md

Key topics covered:

  • Alert severity levels and response times
  • Alert grouping and inhibition rules
  • Escalation policies and runbook links
  • Golden Signals dashboard design
  • SLO/SLI definitions and error budgets

Quick Reference

Log Levels

LevelUse Case
ERRORUnhandled exceptions, failed operations
WARNDeprecated API, retry attempts
INFOBusiness events, successful operations
DEBUGDevelopment troubleshooting

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

Key Alerts

AlertConditionSeverity
ServiceDownup == 0 for 1mCritical
HighErrorRate5xx > 5% for 5mCritical
HighLatencyp95 > 2s for 5mHigh
LowCacheHitRate< 70% for 10mMedium

Health Checks (Kubernetes)

ProbePurposeEndpoint
LivenessIs app running?/health
ReadinessReady for traffic?/ready
StartupFinished starting?/startup

Observability Checklist

Implementation

  • JSON structured logging
  • Request correlation IDs
  • RED metrics (Rate, Errors, Duration)
  • Business metrics
  • Distributed tracing
  • Health check endpoints

Alerting

  • Service outage alerts
  • Error rate thresholds
  • Latency thresholds
  • Resource utilization alerts

Dashboards

  • Service overview
  • Error analysis
  • Performance metrics

Templates Reference

TemplatePurpose
structured-logging.tsWinston logger with request middleware
prometheus-metrics.tsHTTP, DB, cache metrics with middleware
opentelemetry-tracing.tsDistributed tracing setup
alerting-rules.ymlPrometheus alerting rules
health-checks.tsLiveness, readiness, startup probes

Langfuse Integration

For LLM observability, use Langfuse decorators:

from langfuse.decorators import observe, langfuse_context

@observe(name="analyze_content")
async def analyze_content(url: str) -> AnalysisResult:
    langfuse_context.update_current_trace(
        name="content_analysis",
        user_id="system",
        metadata={"url": url}
    )
    # ... workflow implementation

See examples/orchestkit-monitoring-dashboard.md for real-world examples.

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

  • defense-in-depth - Layer 8 observability as part of security architecture
  • devops-deployment - Observability integration with CI/CD and Kubernetes
  • resilience-patterns - Monitoring circuit breakers and failure scenarios
  • fastapi-advanced - FastAPI-specific middleware for logging and metrics

Key Decisions

DecisionChoiceRationale
Log formatStructured JSONMachine-parseable, supports log aggregation, enables queries
Metric typesRED method (Rate, Errors, Duration)Industry standard, covers essential service health indicators
TracingOpenTelemetryVendor-neutral, auto-instrumentation, broad ecosystem support
Alerting severity4 levels (Critical, High, Medium, Low)Clear escalation paths, appropriate response times

Capability Details

structured-logging

Keywords: logging, structured log, json log, correlation id, log level, winston, pino, structlog Solves:

  • How do I set up structured logging?
  • Implement correlation IDs across services
  • JSON logging best practices

correlation-tracking

Keywords: correlation id, request tracking, trace context, distributed logs Solves:

  • How do I track requests across services?
  • Implement correlation IDs in middleware
  • Find all logs for a single request

log-sampling

Keywords: log sampling, high traffic logging, sampling rate, log volume Solves:

  • How do I reduce log volume in production?
  • Sample INFO logs while keeping all errors

prometheus-metrics

Keywords: metrics, prometheus, counter, histogram, gauge, summary, red method Solves:

  • How do I collect application metrics?
  • Implement RED method (Rate, Errors, Duration)
  • Choose between Counter, Gauge, Histogram

metric-types

Keywords: counter, gauge, histogram, summary, bucket, quantile Solves:

  • When to use Counter vs Gauge?
  • Histogram vs Summary for latency
  • Configure histogram buckets

cardinality-management

Keywords: cardinality, label explosion, time series, prometheus performance Solves:

  • How do I prevent label cardinality explosions?
  • Fix unbounded labels (user IDs, request IDs)

distributed-tracing

Keywords: tracing, distributed tracing, opentelemetry, span, trace id, waterfall Solves:

  • How do I implement distributed tracing?
  • OpenTelemetry setup with auto-instrumentation
  • Create manual spans for custom operations

trace-sampling

Keywords: trace sampling, head-based sampling, tail-based sampling Solves:

  • How do I reduce trace volume?
  • Sample 10% of traces but keep all errors

alerting-strategy

Keywords: alert, alerting, notification, threshold, pagerduty, slack, severity Solves:

  • How do I set up effective alerts?
  • Define alert severity levels (P1-P4)

alert-fatigue-prevention

Keywords: alert fatigue, alert grouping, inhibition, escalation Solves:

  • How do I reduce alert noise?
  • Group related alerts together

dashboards

Keywords: dashboard, visualization, grafana, golden signals, red method Solves:

  • How do I create monitoring dashboards?
  • Design Golden Signals dashboard layout
  • Build SLO/SLI dashboards

health-checks

Keywords: health check, liveness, readiness, startup probe, kubernetes Solves:

  • How do I implement health check endpoints?
  • Difference between liveness and readiness

langfuse-observability

Keywords: langfuse, llm observability, llm tracing, token usage, llm cost tracking Solves:

  • How do I monitor LLM calls with Langfuse?
  • Track LLM token usage and cost

llm-cost-tracking

Keywords: llm cost, token tracking, cost optimization, prometheus llm metrics Solves:

  • How do I track LLM costs with Prometheus?
  • Measure token usage by model and operation

スコア

総合スコア

75/100

リポジトリの品質指標に基づく評価

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

レビュー

💬

レビュー機能は近日公開予定です