high-performance-inference

Name: high-performance-inference
Rating: 75
Author: yonatangross

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

⭐ 29🍴 4📅 Jan 23, 2026

agents ai-development claude-code claude-plugin fastapi langgraph llm mcp

View on GitHub Run in Manus

SKILL.md

name: high-performance-inference description: High-performance LLM inference with vLLM, quantization (AWQ, GPTQ, FP8), speculative decoding, and edge deployment. Use when optimizing inference latency, throughput, or memory. version: 1.0.0 tags: [vllm, quantization, inference, performance, edge, speculative, 2026] context: fork agent: llm-integrator author: OrchestKit user-invocable: false

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

Deploying LLMs with low latency requirements
Reducing GPU memory for larger models
Maximizing throughput for batch inference
Edge/mobile deployment with constrained resources
Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 8192

# With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --quantization awq \
    --speculative-config '{"method": "ngram", "num_speculative_tokens": 5}' \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9

vLLM 0.14.x Key Features

Feature	Benefit
PagedAttention	Up to 24x throughput via efficient KV cache
Continuous Batching	Dynamic request batching for max utilization
CUDA Graphs	Fast model execution with graph capture
Tensor Parallelism	Scale across multiple GPUs
Prefix Caching	Reuse KV cache for shared prefixes
AttentionConfig	New API replacing VLLM_ATTENTION_BACKEND env
Semantic Router	vLLM SR v0.1 "Iris" for intelligent LLM routing

Python vLLM Integration

from vllm import LLM, SamplingParams

# Initialize with optimization flags
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization="awq",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

# Generate
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Quantization Methods

Method	Bits	Memory Savings	Speed	Quality
FP16	16	Baseline	Baseline	Best
INT8	8	50%	+10-20%	Very Good
AWQ	4	75%	+20-40%	Good
GPTQ	4	75%	+15-30%	Good
FP8	8	50%	+30-50%	Very Good

When to Use Each:

FP16: Maximum quality, sufficient memory
INT8/FP8: Balance of quality and efficiency
AWQ: Best 4-bit quality, activation-aware
GPTQ: Faster quantization, good quality

Speculative Decoding

Accelerate generation by predicting multiple tokens:

# N-gram based (no extra model)
speculative_config = {
    "method": "ngram",
    "num_speculative_tokens": 5,
    "prompt_lookup_max": 5,
    "prompt_lookup_min": 2,
}

# Draft model (higher quality)
speculative_config = {
    "method": "draft_model",
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 3,
}

Expected Gains: 1.5-2.5x throughput for autoregressive tasks.

Key Decisions

Decision	Recommendation
Quantization	AWQ for 4-bit, FP8 for H100/H200
Batch size	Dynamic via continuous batching
GPU memory	0.85-0.95 utilization
Parallelism	Tensor parallel across GPUs
KV cache	Enable prefix caching for shared contexts

Common Mistakes

Using GPTQ without calibration data (poor quality)
Over-allocating GPU memory (OOM on peak loads)
Ignoring warmup requests (cold start latency)
Not benchmarking actual workload patterns
Mixing quantization with incompatible features

Performance Benchmarking

from vllm import LLM, SamplingParams
import time

def benchmark_throughput(llm, prompts, sampling_params, num_runs=3):
    """Benchmark tokens per second."""
    total_tokens = 0
    total_time = 0

    for _ in range(num_runs):
        start = time.perf_counter()
        outputs = llm.generate(prompts, sampling_params)
        elapsed = time.perf_counter() - start

        tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
        total_tokens += tokens
        total_time += elapsed

    return total_tokens / total_time  # tokens/sec

Advanced Patterns

See references/ for:

vLLM Deployment: PagedAttention, batching, production config
Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
Speculative Decoding: Draft models, n-gram, throughput tuning
Edge Deployment: Mobile, resource-constrained optimization

llm-streaming - Streaming token responses
function-calling - Tool use with inference
ollama-local - Local inference with Ollama
prompt-caching - Reduce redundant computation
semantic-caching - Cache full responses

Capability Details

vllm-deployment

Keywords: vllm, inference server, deploy, serve, production Solves:

Deploy LLMs with vLLM for production
Configure tensor parallelism and batching
Optimize GPU memory utilization

quantization

Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:

Reduce model memory footprint
Choose appropriate quantization method
Maintain quality with lower precision

speculative-decoding

Keywords: speculative, draft model, faster generation, predict tokens Solves:

Accelerate autoregressive generation
Configure draft models or n-gram speculation
Tune speculative token count

edge-inference

Keywords: edge, mobile, embedded, constrained, optimization Solves:

Deploy on resource-constrained devices
Optimize for mobile/edge hardware
Balance quality and resource usage

throughput-optimization

Keywords: throughput, latency, performance, benchmark, optimize Solves:

Maximize requests per second
Reduce time to first token
Benchmark and tune performance

Score

Total Score

75/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

high-performance-inference

SKILL.md

High-Performance Inference

Overview

Quick Reference

vLLM 0.14.x Key Features

Python vLLM Integration

Quantization Methods

Speculative Decoding

Key Decisions

Common Mistakes

Performance Benchmarking

Advanced Patterns

Capability Details

vllm-deployment

quantization

speculative-decoding

edge-inference

throughput-optimization

Score

Reviews

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

high-performance-inference

SKILL.md

High-Performance Inference

Overview

Quick Reference

vLLM 0.14.x Key Features

Python vLLM Integration

Quantization Methods

Speculative Decoding

Key Decisions

Common Mistakes

Performance Benchmarking

Advanced Patterns

Related Skills

Capability Details

vllm-deployment

quantization

speculative-decoding

edge-inference

throughput-optimization

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices