
langfuse-observability
by yonatangross
The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.
SKILL.md
name: langfuse-observability description: LLM observability platform for tracing, evaluation, prompt management, and cost tracking. Use when setting up Langfuse, monitoring LLM costs, tracking token usage, or implementing prompt versioning. context: fork agent: metrics-architect version: 1.0.0 author: OrchestKit AI Agent Hub tags: [langfuse, llm, observability, tracing, evaluation, prompts, 2026] user-invocable: false
Langfuse Observability
Overview
Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
- Setting up LLM observability from scratch
- Debugging slow or incorrect LLM responses
- Tracking token usage and costs
- Managing prompts in production
- Evaluating LLM output quality
- Migrating from LangSmith to Langfuse
OrchestKit Integration:
- Status: Migrated from LangSmith (Dec 2025)
- Location:
backend/app/shared/services/langfuse/ - MCP Server:
orchestkit-langfuse(optional)
Quick Start
Setup
# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings
langfuse_client = Langfuse(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY,
host=settings.LANGFUSE_HOST # Self-hosted or cloud
)
Basic Tracing with @observe
from langfuse.decorators import observe, langfuse_context
@observe() # Automatic tracing
async def analyze_content(content: str):
langfuse_context.update_current_observation(
metadata={"content_length": len(content)}
)
return await llm.generate(content)
Session & User Tracking
langfuse.trace(
name="analysis",
user_id="user_123",
session_id="session_abc",
metadata={"content_type": "article", "agent_count": 8},
tags=["production", "orchestkit"]
)
Core Features Summary
| Feature | Description | Reference |
|---|---|---|
| Distributed Tracing | Track LLM calls with parent-child spans | references/tracing-setup.md |
| Cost Tracking | Automatic token & cost calculation | references/cost-tracking.md |
| Prompt Management | Version control for prompts | references/prompt-management.md |
| LLM Evaluation | Custom scoring with G-Eval | references/evaluation-scores.md |
| Session Tracking | Group related traces | references/session-tracking.md |
| Experiments API | A/B testing & benchmarks | references/experiments-api.md |
| Multi-Judge Eval | Ensemble LLM evaluation | references/multi-judge-evaluation.md |
References
Tracing Setup
See: references/tracing-setup.md
Key topics covered:
- Initializing Langfuse client with @observe decorator
- Creating nested traces and spans
- Tracking LLM generations with metadata
- LangChain/LangGraph CallbackHandler integration
- Workflow integration patterns
Cost Tracking
See: references/cost-tracking.md
Key topics covered:
- Automatic cost calculation from token usage
- Custom model pricing configuration
- Monitoring dashboard SQL queries
- Cost tracking per analysis/user
- Daily cost trend analysis
Prompt Management
See: references/prompt-management.md
Key topics covered:
- Prompt versioning and labels (production/staging/draft)
- Template variables with Jinja2 syntax
- A/B testing prompt versions
- OrchestKit 4-level caching architecture (L1-L4)
- Linking prompts to generation spans
LLM Evaluation
See: references/evaluation-scores.md
Key topics covered:
- Custom scoring with numeric/categorical values
- G-Eval automated quality assessment
- Score trends and comparisons
- Filtering traces by score thresholds
Session Tracking
See: references/session-tracking.md
Key topics covered:
- Grouping traces by session_id
- Multi-turn conversation tracking
- User and metadata analytics
Experiments API
See: references/experiments-api.md
Key topics covered:
- Creating test datasets in Langfuse
- Running automated evaluations
- Regression testing for LLMs
- Benchmarking prompt versions
Multi-Judge Evaluation
See: references/multi-judge-evaluation.md
Key topics covered:
- Multiple LLM judges for quality assessment
- Weighted scoring across judges
- OrchestKit langfuse_evaluators.py integration
Best Practices
- Always use @observe decorator for automatic tracing
- Set user_id and session_id for better analytics
- Add meaningful metadata (content_type, analysis_id, etc.)
- Score all production traces for quality monitoring
- Use prompt management instead of hardcoded prompts
- Monitor costs daily to catch spikes early
- Create datasets for regression testing
- Tag production vs staging traces
LangSmith Migration Notes
Key Differences:
| Aspect | Langfuse | LangSmith |
|---|---|---|
| Hosting | Self-hosted, open-source | Cloud-only, proprietary |
| Cost | Free | Paid |
| Prompts | Built-in management | External storage needed |
| Decorator | @observe | @traceable |
External References
Related Skills
observability-monitoring- General observability patterns for metrics, logging, and alertingllm-evaluation- Evaluation patterns that integrate with Langfuse scoringllm-streaming- Streaming response patterns with trace instrumentationprompt-caching- Caching strategies that reduce costs tracked by Langfuse
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Observability platform | Langfuse (not LangSmith) | Open-source, self-hosted, free, built-in prompt management |
| Tracing approach | @observe decorator | Automatic, low-overhead instrumentation |
| Cost tracking | Automatic token counting | Built-in model pricing with custom overrides |
| Prompt management | Langfuse native | Version control, A/B testing, labels in one place |
Capability Details
distributed-tracing
Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:
- How do I trace LLM calls across my application?
- How to debug slow LLM responses?
- Track execution flow in multi-agent workflows
- Create nested trace spans
cost-tracking
Keywords: cost, token usage, pricing, budget, spend, expense Solves:
- How do I track LLM costs?
- Calculate token usage and pricing
- Monitor AI budget and spending
- Track cost per user or session
prompt-management
Keywords: prompt version, prompt template, prompt control, prompt registry Solves:
- How do I version control prompts?
- Manage prompts in production
- A/B test different prompt versions
- Link prompts to traces
llm-evaluation
Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:
- How do I evaluate LLM output quality?
- Score responses with custom metrics
- Track quality trends over time
- Compare prompt versions by quality
session-tracking
Keywords: session, user tracking, conversation, group traces Solves:
- How do I group related traces?
- Track multi-turn conversations
- Monitor per-user performance
- Organize traces by session
langchain-integration
Keywords: langchain, callback, handler, langgraph integration Solves:
- How do I integrate Langfuse with LangChain?
- Use CallbackHandler for tracing
- Automatic LangGraph workflow tracing
- LangChain observability setup
datasets-evaluation
Keywords: dataset, test set, evaluation dataset, benchmark Solves:
- How do I create test datasets in Langfuse?
- Run automated evaluations
- Regression testing for LLMs
- Benchmark prompt versions
ab-testing
Keywords: a/b test, experiment, compare prompts, variant testing Solves:
- How do I A/B test prompts?
- Compare two prompt versions
- Experimental prompt evaluation
- Statistical prompt testing
monitoring-dashboard
Keywords: dashboard, analytics, metrics, monitoring, queries Solves:
- What are the most expensive traces?
- Average cost by agent type
- Quality score trends
- Custom monitoring queries
orchestkit-integration
Keywords: orchestkit, migration, setup, workflow integration Solves:
- How does OrchestKit use Langfuse?
- Migrate from LangSmith to Langfuse
- OrchestKit workflow tracing patterns
- Cost tracking per analysis
multi-judge-evaluation
Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:
- How do I use multiple LLM judges to evaluate quality?
- Set up G-Eval criteria evaluation
- Configure weighted scoring across judges
- Wire OrchestKit's existing langfuse_evaluators.py
experiments-api
Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:
- How do I run experiments across datasets?
- A/B test models and prompts systematically
- Track quality regression over time
- Compare experiment results
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
