
high-performance-inference
by yonatangross
The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.
SKILL.md
name: high-performance-inference description: High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory. version: 1.0.0 tags: [vllm, quantization, inference, performance, edge, speculative, 2026] context: fork agent: llm-integrator author: OrchestKit user-invocable: false
High-Performance Inference
Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.
vLLM 0.14.0 (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.
Overview
- Deploying LLMs with low latency requirements
- Reducing GPU memory for larger models
- Maximizing throughput for batch inference
- Edge/mobile deployment with constrained resources
- Cost optimization through efficient hardware utilization
Quick Reference
# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192
# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
vLLM 0.14.x Key Features
| Feature | Benefit |
|---|---|
| PagedAttention | Up to 24x throughput via efficient KV cache |
| Continuous Batching | Dynamic request batching for max utilization |
| CUDA Graphs | Fast model execution with graph capture |
| Tensor Parallelism | Scale across multiple GPUs |
| Prefix Caching | Reuse KV cache for shared prefixes |
| AttentionConfig | New API replacing VLLM_ATTENTION_BACKEND env |
| Semantic Router | vLLM SR v0.1 "Iris" for intelligent LLM routing |
Python vLLM Integration
from vllm import LLM, SamplingParams
# Initialize with optimization flags
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization="awq",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
enable_prefix_caching=True,
)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
)
# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Quantization Methods
| Method | Bits | Memory Savings | Speed | Quality |
|---|---|---|---|---|
| FP16 | 16 | Baseline | Baseline | Best |
| INT8 | 8 | 50% | +10-20% | Very Good |
| AWQ | 4 | 75% | +20-40% | Good |
| GPTQ | 4 | 75% | +15-30% | Good |
| FP8 | 8 | 50% | +30-50% | Very Good |
When to Use Each:
- FP16: Maximum quality, sufficient memory
- INT8/FP8: Balance of quality and efficiency
- AWQ: Best 4-bit quality, activation-aware
- GPTQ: Faster quantization, good quality
Speculative Decoding
Accelerate generation by predicting multiple tokens:
# N-gram based (no extra model)
speculative_config = {
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 5,
"prompt_lookup_min": 2,
}
# Draft model (higher quality)
speculative_config = {
"method": "draft_model",
"draft_model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 3,
}
Expected Gains: 1.5-2.5x throughput for autoregressive tasks.
Key Decisions
| Decision | Recommendation |
|---|---|
| Quantization | AWQ for 4-bit, FP8 for H100/H200 |
| Batch size | Dynamic via continuous batching |
| GPU memory | 0.85-0.95 utilization |
| Parallelism | Tensor parallel across GPUs |
| KV cache | Enable prefix caching for shared contexts |
Common Mistakes
- Using GPTQ without calibration data (poor quality)
- Over-allocating GPU memory (OOM on peak loads)
- Ignoring warmup requests (cold start latency)
- Not benchmarking actual workload patterns
- Mixing quantization with incompatible features
Performance Benchmarking
from vllm import LLM, SamplingParams
import time
def benchmark_throughput(llm, prompts, sampling_params, num_runs=3):
"""Benchmark tokens per second."""
total_tokens = 0
total_time = 0
for _ in range(num_runs):
start = time.perf_counter()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.perf_counter() - start
tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
total_tokens += tokens
total_time += elapsed
return total_tokens / total_time # tokens/sec
Advanced Patterns
See references/ for:
- vLLM Deployment: PagedAttention, batching, production config
- Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
- Speculative Decoding: Draft models, n-gram, throughput tuning
- Edge Deployment: Mobile, resource-constrained optimization
Related Skills
llm-streaming- Streaming token responsesfunction-calling- Tool use with inferenceollama-local- Local inference with Ollamaprompt-caching- Reduce redundant computationsemantic-caching- Cache full responses
Capability Details
vllm-deployment
Keywords: vllm, inference server, deploy, serve, production Solves:
- Deploy LLMs with vLLM for production
- Configure tensor parallelism and batching
- Optimize GPU memory utilization
quantization
Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:
- Reduce model memory footprint
- Choose appropriate quantization method
- Maintain quality with lower precision
speculative-decoding
Keywords: speculative, draft model, faster generation, predict tokens Solves:
- Accelerate autoregressive generation
- Configure draft models or n-gram speculation
- Tune speculative token count
edge-inference
Keywords: edge, mobile, embedded, constrained, optimization Solves:
- Deploy on resource-constrained devices
- Optimize for mobile/edge hardware
- Balance quality and resource usage
throughput-optimization
Keywords: throughput, latency, performance, benchmark, optimize Solves:
- Maximize requests per second
- Reduce time to first token
- Benchmark and tune performance
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
