mlx

Name: mlx
Rating: 60
Author: itsmostafa

by itsmostafa

LLM Engineering Claude Skills

⭐ 10🍴 0📅 Jan 6, 2026

ai-agents ai-engineering claude-code claude-skills codex-cli generative-ai llm llm-engineering

View on GitHub Run in Manus

SKILL.md

name: mlx description: Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.

Using MLX for LLMs on Apple Silicon

MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.

Core Concepts
Installation
Text Generation
Interactive Chat
Model Conversion
Quantization
Fine-tuning with LoRA
Serving Models
Best Practices
References

Core Concepts

Why MLX

Aspect	PyTorch on Mac	MLX
Memory	Separate CPU/GPU copies	Unified memory, no copies
Optimization	Generic Metal backend	Apple Silicon native
Model loading	Slower, more memory	Lazy loading, efficient
Quantization	Limited support	Built-in 4/8-bit

MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.

Supported Models

MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.

Installation

pip install mlx-lm

Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).

Text Generation

Python API

from mlx_lm import load, generate

# Load model (from HF hub or local path)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate text
response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms:",
    max_tokens=256,
    temp=0.7,
)
print(response)

Streaming Generation

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

Batch Generation

from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "What is machine learning?",
    "Explain neural networks:",
    "Define deep learning:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n")

CLI Generation

# Basic generation
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --prompt "Explain recursion:" \
    --max-tokens 256

# With sampling parameters
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --prompt "Write a poem about AI:" \
    --temp 0.8 \
    --top-p 0.95

Interactive Chat

CLI Chat

# Start chat REPL (context preserved between turns)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

Python Chat

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Model Conversion

Convert Hugging Face models to MLX format:

CLI Conversion

# Convert with 4-bit quantization
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \
    -q  # Quantize to 4-bit

# With specific quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --q-bits 8 \
    --q-group-size 64

# Upload to Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \
    -q \
    --upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx

Python Conversion

from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

Conversion Options

Option	Default	Description
`--q-bits`	4	Quantization bits (4 or 8)
`--q-group-size`	64	Group size for quantization
`--dtype`	float16	Data type for non-quantized weights

Quantization

MLX supports multiple quantization methods for different use cases:

Method	Best For	Command
Basic	Quick conversion	`mlx_lm.convert -q`
DWQ	Quality-preserving	`mlx_lm.dwq`
AWQ	Activation-aware	`mlx_lm.awq`
Dynamic	Per-layer precision	`mlx_lm.dynamic_quant`
GPTQ	Established method	`mlx_lm.gptq`

Quick Quantization

# 4-bit quantization during conversion
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q

# 8-bit for higher quality
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8

For detailed coverage of each method, see reference/quantization.md.

Fine-tuning with LoRA

MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.

Quick Start

# Prepare training data (JSONL format)
# {"text": "Your training text here"}
# or
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --train \
    --data ./data \
    --iters 1000

# Generate with adapter
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --prompt "Your prompt here"

Fuse Adapter into Model

# Merge LoRA weights into base model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --save-path ./fused-model

# Or export to GGUF
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --export-gguf

For detailed LoRA configuration and training patterns, see reference/fine-tuning.md.

Serving Models

OpenAI-Compatible Server

# Start server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

# Use with OpenAI client
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 256
    }'

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)

Best Practices

Use pre-quantized models: Download from mlx-community on Hugging Face for immediate use
Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
Use streaming for UX: stream_generate provides responsive output for interactive applications
Cache prompt prefixes: Use mlx_lm.cache_prompt for repeated prompts with varying suffixes
Batch similar requests: batch_generate is more efficient than sequential generation
Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
Use chat templates: Always apply tokenizer.apply_chat_template() for instruction-tuned models

References

See reference/ for detailed documentation:

quantization.md - Detailed quantization methods and when to use each
fine-tuning.md - Complete LoRA/QLoRA training guide with data formats and configuration

Score

Total Score

60/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

3ヶ月以内に更新

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

○言語

プログラミング言語が設定されている

0/5

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

mlx

SKILL.md

name: mlx description: Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.

Using MLX for LLMs on Apple Silicon

Table of Contents

Core Concepts

Why MLX

Supported Models

Installation

Text Generation

Python API

Streaming Generation

Batch Generation

CLI Generation

Interactive Chat

CLI Chat

Python Chat

Model Conversion

CLI Conversion

Python Conversion

Conversion Options

Quantization

Quick Quantization

Fine-tuning with LoRA

Quick Start

Fuse Adapter into Model

Serving Models

OpenAI-Compatible Server

Python Client

Best Practices

References

Score

Reviews

browser-use

prompt-lookup

skill-lookup

orpc-contract-first

component-refactoring

web-design-guidelines

mlx

SKILL.md

name: mlx description: Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.

Using MLX for LLMs on Apple Silicon

Table of Contents

Core Concepts

Why MLX

Supported Models

Installation

Text Generation

Python API

Streaming Generation

Batch Generation

CLI Generation

Interactive Chat

CLI Chat

Python Chat

Model Conversion

CLI Conversion

Python Conversion

Conversion Options

Quantization

Quick Quantization

Fine-tuning with LoRA

Quick Start

Fuse Adapter into Model

Serving Models

OpenAI-Compatible Server

Python Client

Best Practices

References

Score

Reviews

Related

Related Skills

browser-use

prompt-lookup

skill-lookup

orpc-contract-first

component-refactoring

web-design-guidelines