スキル一覧に戻る
yonatangross

high-performance-inference

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 2026年1月23日
GitHubで見るManusで実行

SKILL.md


name: high-performance-inference description: High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory. version: 1.0.0 tags: [vllm, quantization, inference, performance, edge, speculative, 2026] context: fork agent: llm-integrator author: OrchestKit user-invocable: false

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements
  • Reducing GPU memory for larger models
  • Maximizing throughput for batch inference
  • Edge/mobile deployment with constrained resources
  • Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 8192

# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9

vLLM 0.14.x Key Features

FeatureBenefit
PagedAttentionUp to 24x throughput via efficient KV cache
Continuous BatchingDynamic request batching for max utilization
CUDA GraphsFast model execution with graph capture
Tensor ParallelismScale across multiple GPUs
Prefix CachingReuse KV cache for shared prefixes
AttentionConfigNew API replacing VLLM_ATTENTION_BACKEND env
Semantic RoutervLLM SR v0.1 "Iris" for intelligent LLM routing

Python vLLM Integration

from vllm import LLM, SamplingParams

# Initialize with optimization flags
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Quantization Methods

MethodBitsMemory SavingsSpeedQuality
FP1616BaselineBaselineBest
INT8850%+10-20%Very Good
AWQ475%+20-40%Good
GPTQ475%+15-30%Good
FP8850%+30-50%Very Good

When to Use Each:

  • FP16: Maximum quality, sufficient memory
  • INT8/FP8: Balance of quality and efficiency
  • AWQ: Best 4-bit quality, activation-aware
  • GPTQ: Faster quantization, good quality

Speculative Decoding

Accelerate generation by predicting multiple tokens:

# N-gram based (no extra model)
speculative_config = {
    "method": "ngram",
    "num_speculative_tokens": 5,
    "prompt_lookup_max": 5,
    "prompt_lookup_min": 2,
}

# Draft model (higher quality)
speculative_config = {
    "method": "draft_model",
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 3,
}

Expected Gains: 1.5-2.5x throughput for autoregressive tasks.

Key Decisions

DecisionRecommendation
QuantizationAWQ for 4-bit, FP8 for H100/H200
Batch sizeDynamic via continuous batching
GPU memory0.85-0.95 utilization
ParallelismTensor parallel across GPUs
KV cacheEnable prefix caching for shared contexts

Common Mistakes

  • Using GPTQ without calibration data (poor quality)
  • Over-allocating GPU memory (OOM on peak loads)
  • Ignoring warmup requests (cold start latency)
  • Not benchmarking actual workload patterns
  • Mixing quantization with incompatible features

Performance Benchmarking

from vllm import LLM, SamplingParams
import time

def benchmark_throughput(llm, prompts, sampling_params, num_runs=3):
    """Benchmark tokens per second."""
    total_tokens = 0
    total_time = 0

    for _ in range(num_runs):
        start = time.perf_counter()
        outputs = llm.generate(prompts, sampling_params)
        elapsed = time.perf_counter() - start

        tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
        total_tokens += tokens
        total_time += elapsed

    return total_tokens / total_time  # tokens/sec

Advanced Patterns

See references/ for:

  • vLLM Deployment: PagedAttention, batching, production config
  • Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
  • Speculative Decoding: Draft models, n-gram, throughput tuning
  • Edge Deployment: Mobile, resource-constrained optimization
  • llm-streaming - Streaming token responses
  • function-calling - Tool use with inference
  • ollama-local - Local inference with Ollama
  • prompt-caching - Reduce redundant computation
  • semantic-caching - Cache full responses

Capability Details

vllm-deployment

Keywords: vllm, inference server, deploy, serve, production Solves:

  • Deploy LLMs with vLLM for production
  • Configure tensor parallelism and batching
  • Optimize GPU memory utilization

quantization

Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:

  • Reduce model memory footprint
  • Choose appropriate quantization method
  • Maintain quality with lower precision

speculative-decoding

Keywords: speculative, draft model, faster generation, predict tokens Solves:

  • Accelerate autoregressive generation
  • Configure draft models or n-gram speculation
  • Tune speculative token count

edge-inference

Keywords: edge, mobile, embedded, constrained, optimization Solves:

  • Deploy on resource-constrained devices
  • Optimize for mobile/edge hardware
  • Balance quality and resource usage

throughput-optimization

Keywords: throughput, latency, performance, benchmark, optimize Solves:

  • Maximize requests per second
  • Reduce time to first token
  • Benchmark and tune performance

スコア

総合スコア

75/100

リポジトリの品質指標に基づく評価

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

レビュー

💬

レビュー機能は近日公開予定です