← Back to list

multimodal-rag
by yonatangross
The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.
⭐ 29🍴 4📅 Jan 23, 2026
SKILL.md
name: multimodal-rag description: CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval. context: fork agent: multimodal-specialist version: 1.0.0 author: OrchestKit user-invocable: false tags: [rag, multimodal, image-retrieval, clip, embeddings, vector-search, 2026]
Multimodal RAG (2026)
Build retrieval-augmented generation systems that handle images, text, and mixed content.
Overview
- Image + text retrieval (product search, documentation)
- Cross-modal search (text query -> image results)
- Multimodal document processing (PDFs with charts)
- Visual question answering with context
- Image similarity and deduplication
- Hybrid search pipelines
Architecture Approaches
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Joint Embedding (CLIP) | Direct comparison | Limited context | Pure image search |
| Caption-based | Works with text LLMs | Lossy conversion | Existing text RAG |
| Hybrid | Best accuracy | More complex | Production systems |
Embedding Models (2026)
| Model | Context | Modalities | Best For |
|---|---|---|---|
| Voyage multimodal-3 | 32K tokens | Text + Image | Long documents |
| SigLIP 2 | Standard | Text + Image | Large-scale retrieval |
| CLIP ViT-L/14 | 77 tokens | Text + Image | General purpose |
| ImageBind | Standard | 6 modalities | Audio/video included |
| ColPali | Document | Text + Image | PDF/document RAG |
CLIP-Based Image Embeddings
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
"""Generate CLIP embedding for an image."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
def embed_text(text: str) -> list[float]:
"""Generate CLIP embedding for text query."""
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = model.get_text_features(**inputs)
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
# Cross-modal search: text -> images
def search_images(query: str, image_embeddings: list, top_k: int = 5):
"""Search images using text query."""
query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
np.dot(query_embedding, img_emb)
for img_emb in image_embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]
Voyage Multimodal-3 (Long Context)
import voyageai
client = voyageai.Client()
def embed_multimodal_voyage(
texts: list[str] = None,
images: list[str] = None # File paths or URLs
) -> list[list[float]]:
"""Embed text and/or images with 32K token context."""
inputs = []
if texts:
inputs.extend([{"type": "text", "content": t} for t in texts])
if images:
for img_path in images:
with open(img_path, "rb") as f:
import base64
b64 = base64.b64encode(f.read()).decode()
inputs.append({
"type": "image",
"content": f"data:image/png;base64,{b64}"
})
response = client.multimodal_embed(
inputs=inputs,
model="voyage-multimodal-3"
)
return response.embeddings
Hybrid RAG Pipeline
from typing import Optional
import numpy as np
class MultimodalRAG:
"""Production multimodal RAG with hybrid retrieval."""
def __init__(self, vector_db, vision_model, text_model):
self.vector_db = vector_db
self.vision_model = vision_model
self.text_model = text_model
async def index_document(
self,
doc_id: str,
text: Optional[str] = None,
image_path: Optional[str] = None,
metadata: dict = None
):
"""Index a document with text and/or image."""
embeddings = []
if text:
text_emb = embed_text(text)
embeddings.append(("text", text_emb))
if image_path:
# Option 1: Direct image embedding
img_emb = embed_image(image_path)
embeddings.append(("image", img_emb))
# Option 2: Generate caption for text search
caption = await self.generate_caption(image_path)
caption_emb = embed_text(caption)
embeddings.append(("caption", caption_emb))
# Store with shared document ID
for emb_type, emb in embeddings:
await self.vector_db.upsert(
id=f"{doc_id}_{emb_type}",
embedding=emb,
metadata={
"doc_id": doc_id,
"type": emb_type,
"image_url": image_path,
"text": text,
**(metadata or {})
}
)
async def generate_caption(self, image_path: str) -> str:
"""Generate text caption for image indexing."""
# Use GPT-4o or Claude for high-quality captions
response = await self.vision_model.analyze(
image_path,
prompt="Describe this image in detail for search indexing. "
"Include objects, text, colors, and context."
)
return response
async def retrieve(
self,
query: str,
query_image: Optional[str] = None,
top_k: int = 10
) -> list[dict]:
"""Hybrid retrieval with optional image query."""
results = []
# Text query embedding
text_emb = embed_text(query)
text_results = await self.vector_db.search(
embedding=text_emb,
top_k=top_k
)
results.extend(text_results)
# Image query embedding (if provided)
if query_image:
img_emb = embed_image(query_image)
img_results = await self.vector_db.search(
embedding=img_emb,
top_k=top_k
)
results.extend(img_results)
# Dedupe by doc_id, keep highest score
seen = {}
for r in results:
doc_id = r["metadata"]["doc_id"]
if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
seen[doc_id] = r
return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]
Multimodal Document Chunking
from dataclasses import dataclass
from typing import Literal
@dataclass
class Chunk:
content: str
chunk_type: Literal["text", "image", "table", "chart"]
page: int
image_path: Optional[str] = None
embedding: Optional[list[float]] = None
def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
"""Chunk PDF preserving images and tables."""
from pdf2image import convert_from_path
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
# Extract text blocks
text_blocks = page.get_text("blocks")
current_text = ""
for block in text_blocks:
if block[6] == 0: # Text block
current_text += block[4] + "\n"
else: # Image block
# Save current text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
current_text = ""
# Extract and save image
xref = block[7]
img = doc.extract_image(xref)
img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
with open(img_path, "wb") as f:
f.write(img["image"])
# Generate caption for the image
caption = generate_image_caption(img_path)
chunks.append(Chunk(
content=caption,
chunk_type="image",
page=page_num,
image_path=img_path
))
# Final text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
return chunks
Vector Database Setup (Milvus)
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
def setup_multimodal_collection():
"""Create Milvus collection for multimodal embeddings."""
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
FieldSchema(name="page", dtype=DataType.INT64)
]
schema = CollectionSchema(fields, "Multimodal document collection")
collection = Collection("multimodal_docs", schema)
# Create index for vector search
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
return collection
Multimodal Generation
async def generate_with_context(
query: str,
retrieved_chunks: list[Chunk],
model: str = "claude-opus-4-5-20251124"
) -> str:
"""Generate response using multimodal context."""
content = []
# Add retrieved images first (attention positioning)
for chunk in retrieved_chunks:
if chunk.chunk_type == "image" and chunk.image_path:
base64_data, media_type = encode_image_base64(chunk.image_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
# Add text context
text_context = "\n\n".join([
f"[Page {c.page}]: {c.content}"
for c in retrieved_chunks if c.chunk_type == "text"
])
content.append({
"type": "text",
"text": f"""Use the following context to answer the question.
Context:
{text_context}
Question: {query}
Provide a detailed answer based on the context and images provided."""
})
response = client.messages.create(
model=model,
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
Key Decisions
| Decision | Recommendation |
|---|---|
| Long documents | Voyage multimodal-3 (32K context) |
| Scale retrieval | SigLIP 2 (optimized for large-scale) |
| PDF processing | ColPali (document-native) |
| Multi-modal search | Hybrid: CLIP + text embeddings |
| Production DB | Milvus or Pinecone with hybrid |
Common Mistakes
- Embedding images without captions (limits text search)
- Not deduplicating by document ID
- Missing image URL storage (can't display results)
- Using only image OR text embeddings (use both)
- Ignoring chunk boundaries (split mid-paragraph)
- Not validating image retrieval quality
Related Skills
vision-language-models- Image analysisembeddings- Text embedding patternsrag-retrieval- Text RAG patternscontextual-retrieval- Hybrid BM25+vector
Capability Details
image-embeddings
Keywords: CLIP, image embedding, visual features, SigLIP Solves:
- Convert images to vector representations
- Enable image similarity search
- Cross-modal retrieval
cross-modal-search
Keywords: text to image, image to text, cross-modal Solves:
- Find images from text queries
- Find text from image queries
- Bridge modalities
multimodal-chunking
Keywords: chunk PDF, split document, extract images Solves:
- Process documents with mixed content
- Preserve image-text relationships
- Handle tables and charts
hybrid-retrieval
Keywords: hybrid search, fusion, multi-embedding Solves:
- Combine text and image search
- Improve retrieval accuracy
- Handle diverse queries
Score
Total Score
75/100
Based on repository quality metrics
✓SKILL.md
SKILL.mdファイルが含まれている
+20
✓LICENSE
ライセンスが設定されている
+10
✓説明文
100文字以上の説明がある
+10
○人気
GitHub Stars 100以上
0/15
✓最近の活動
1ヶ月以内に更新
+10
○フォーク
10回以上フォークされている
0/5
✓Issue管理
オープンIssueが50未満
+5
✓言語
プログラミング言語が設定されている
+5
✓タグ
1つ以上のタグが設定されている
+5
Reviews
💬
Reviews coming soon
