Back to list
yonatangross

langfuse-observability

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 Jan 23, 2026

SKILL.md


name: langfuse-observability description: LLM observability platform for tracing, evaluation, prompt management, and cost tracking. Use when setting up Langfuse, monitoring LLM costs, tracking token usage, or implementing prompt versioning. context: fork agent: metrics-architect version: 1.0.0 author: OrchestKit AI Agent Hub tags: [langfuse, llm, observability, tracing, evaluation, prompts, 2026] user-invocable: false

Langfuse Observability

Overview

Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

When to use this skill:

  • Setting up LLM observability from scratch
  • Debugging slow or incorrect LLM responses
  • Tracking token usage and costs
  • Managing prompts in production
  • Evaluating LLM output quality
  • Migrating from LangSmith to Langfuse

OrchestKit Integration:

  • Status: Migrated from LangSmith (Dec 2025)
  • Location: backend/app/shared/services/langfuse/
  • MCP Server: orchestkit-langfuse (optional)

Quick Start

Setup

# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings

langfuse_client = Langfuse(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY,
    host=settings.LANGFUSE_HOST  # Self-hosted or cloud
)

Basic Tracing with @observe

from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)

Session & User Tracking

langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

Core Features Summary

FeatureDescriptionReference
Distributed TracingTrack LLM calls with parent-child spansreferences/tracing-setup.md
Cost TrackingAutomatic token & cost calculationreferences/cost-tracking.md
Prompt ManagementVersion control for promptsreferences/prompt-management.md
LLM EvaluationCustom scoring with G-Evalreferences/evaluation-scores.md
Session TrackingGroup related tracesreferences/session-tracking.md
Experiments APIA/B testing & benchmarksreferences/experiments-api.md
Multi-Judge EvalEnsemble LLM evaluationreferences/multi-judge-evaluation.md

References

Tracing Setup

See: references/tracing-setup.md

Key topics covered:

  • Initializing Langfuse client with @observe decorator
  • Creating nested traces and spans
  • Tracking LLM generations with metadata
  • LangChain/LangGraph CallbackHandler integration
  • Workflow integration patterns

Cost Tracking

See: references/cost-tracking.md

Key topics covered:

  • Automatic cost calculation from token usage
  • Custom model pricing configuration
  • Monitoring dashboard SQL queries
  • Cost tracking per analysis/user
  • Daily cost trend analysis

Prompt Management

See: references/prompt-management.md

Key topics covered:

  • Prompt versioning and labels (production/staging/draft)
  • Template variables with Jinja2 syntax
  • A/B testing prompt versions
  • OrchestKit 4-level caching architecture (L1-L4)
  • Linking prompts to generation spans

LLM Evaluation

See: references/evaluation-scores.md

Key topics covered:

  • Custom scoring with numeric/categorical values
  • G-Eval automated quality assessment
  • Score trends and comparisons
  • Filtering traces by score thresholds

Session Tracking

See: references/session-tracking.md

Key topics covered:

  • Grouping traces by session_id
  • Multi-turn conversation tracking
  • User and metadata analytics

Experiments API

See: references/experiments-api.md

Key topics covered:

  • Creating test datasets in Langfuse
  • Running automated evaluations
  • Regression testing for LLMs
  • Benchmarking prompt versions

Multi-Judge Evaluation

See: references/multi-judge-evaluation.md

Key topics covered:

  • Multiple LLM judges for quality assessment
  • Weighted scoring across judges
  • OrchestKit langfuse_evaluators.py integration

Best Practices

  1. Always use @observe decorator for automatic tracing
  2. Set user_id and session_id for better analytics
  3. Add meaningful metadata (content_type, analysis_id, etc.)
  4. Score all production traces for quality monitoring
  5. Use prompt management instead of hardcoded prompts
  6. Monitor costs daily to catch spikes early
  7. Create datasets for regression testing
  8. Tag production vs staging traces

LangSmith Migration Notes

Key Differences:

AspectLangfuseLangSmith
HostingSelf-hosted, open-sourceCloud-only, proprietary
CostFreePaid
PromptsBuilt-in managementExternal storage needed
Decorator@observe@traceable

External References


  • observability-monitoring - General observability patterns for metrics, logging, and alerting
  • llm-evaluation - Evaluation patterns that integrate with Langfuse scoring
  • llm-streaming - Streaming response patterns with trace instrumentation
  • prompt-caching - Caching strategies that reduce costs tracked by Langfuse

Key Decisions

DecisionChoiceRationale
Observability platformLangfuse (not LangSmith)Open-source, self-hosted, free, built-in prompt management
Tracing approach@observe decoratorAutomatic, low-overhead instrumentation
Cost trackingAutomatic token countingBuilt-in model pricing with custom overrides
Prompt managementLangfuse nativeVersion control, A/B testing, labels in one place

Capability Details

distributed-tracing

Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:

  • How do I trace LLM calls across my application?
  • How to debug slow LLM responses?
  • Track execution flow in multi-agent workflows
  • Create nested trace spans

cost-tracking

Keywords: cost, token usage, pricing, budget, spend, expense Solves:

  • How do I track LLM costs?
  • Calculate token usage and pricing
  • Monitor AI budget and spending
  • Track cost per user or session

prompt-management

Keywords: prompt version, prompt template, prompt control, prompt registry Solves:

  • How do I version control prompts?
  • Manage prompts in production
  • A/B test different prompt versions
  • Link prompts to traces

llm-evaluation

Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:

  • How do I evaluate LLM output quality?
  • Score responses with custom metrics
  • Track quality trends over time
  • Compare prompt versions by quality

session-tracking

Keywords: session, user tracking, conversation, group traces Solves:

  • How do I group related traces?
  • Track multi-turn conversations
  • Monitor per-user performance
  • Organize traces by session

langchain-integration

Keywords: langchain, callback, handler, langgraph integration Solves:

  • How do I integrate Langfuse with LangChain?
  • Use CallbackHandler for tracing
  • Automatic LangGraph workflow tracing
  • LangChain observability setup

datasets-evaluation

Keywords: dataset, test set, evaluation dataset, benchmark Solves:

  • How do I create test datasets in Langfuse?
  • Run automated evaluations
  • Regression testing for LLMs
  • Benchmark prompt versions

ab-testing

Keywords: a/b test, experiment, compare prompts, variant testing Solves:

  • How do I A/B test prompts?
  • Compare two prompt versions
  • Experimental prompt evaluation
  • Statistical prompt testing

monitoring-dashboard

Keywords: dashboard, analytics, metrics, monitoring, queries Solves:

  • What are the most expensive traces?
  • Average cost by agent type
  • Quality score trends
  • Custom monitoring queries

orchestkit-integration

Keywords: orchestkit, migration, setup, workflow integration Solves:

  • How does OrchestKit use Langfuse?
  • Migrate from LangSmith to Langfuse
  • OrchestKit workflow tracing patterns
  • Cost tracking per analysis

multi-judge-evaluation

Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:

  • How do I use multiple LLM judges to evaluate quality?
  • Set up G-Eval criteria evaluation
  • Configure weighted scoring across judges
  • Wire OrchestKit's existing langfuse_evaluators.py

experiments-api

Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:

  • How do I run experiments across datasets?
  • A/B test models and prompts systematically
  • Track quality regression over time
  • Compare experiment results

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon