Back to list
yonatangross

multimodal-rag

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

29🍴 4📅 Jan 23, 2026

SKILL.md


name: multimodal-rag description: CLIP, SigLIP 2, Voyage multimodal-3 patterns for image+text retrieval, cross-modal search, and multimodal document chunking. Use when building RAG with images, implementing visual search, or hybrid retrieval. context: fork agent: multimodal-specialist version: 1.0.0 author: OrchestKit user-invocable: false tags: [rag, multimodal, image-retrieval, clip, embeddings, vector-search, 2026]

Multimodal RAG (2026)

Build retrieval-augmented generation systems that handle images, text, and mixed content.

Overview

  • Image + text retrieval (product search, documentation)
  • Cross-modal search (text query -> image results)
  • Multimodal document processing (PDFs with charts)
  • Visual question answering with context
  • Image similarity and deduplication
  • Hybrid search pipelines

Architecture Approaches

ApproachProsConsBest For
Joint Embedding (CLIP)Direct comparisonLimited contextPure image search
Caption-basedWorks with text LLMsLossy conversionExisting text RAG
HybridBest accuracyMore complexProduction systems

Embedding Models (2026)

ModelContextModalitiesBest For
Voyage multimodal-332K tokensText + ImageLong documents
SigLIP 2StandardText + ImageLarge-scale retrieval
CLIP ViT-L/1477 tokensText + ImageGeneral purpose
ImageBindStandard6 modalitiesAudio/video included
ColPaliDocumentText + ImagePDF/document RAG

CLIP-Based Image Embeddings

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]:
    """Generate CLIP embedding for an image."""
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)

    # Normalize for cosine similarity
    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
    return embeddings[0].tolist()

def embed_text(text: str) -> list[float]:
    """Generate CLIP embedding for text query."""
    inputs = processor(text=[text], return_tensors="pt", padding=True)

    with torch.no_grad():
        embeddings = model.get_text_features(**inputs)

    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
    return embeddings[0].tolist()

# Cross-modal search: text -> images
def search_images(query: str, image_embeddings: list, top_k: int = 5):
    """Search images using text query."""
    query_embedding = embed_text(query)

    # Compute similarities (cosine)
    similarities = [
        np.dot(query_embedding, img_emb)
        for img_emb in image_embeddings
    ]

    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return top_indices, [similarities[i] for i in top_indices]

Voyage Multimodal-3 (Long Context)

import voyageai

client = voyageai.Client()

def embed_multimodal_voyage(
    texts: list[str] = None,
    images: list[str] = None  # File paths or URLs
) -> list[list[float]]:
    """Embed text and/or images with 32K token context."""
    inputs = []

    if texts:
        inputs.extend([{"type": "text", "content": t} for t in texts])

    if images:
        for img_path in images:
            with open(img_path, "rb") as f:
                import base64
                b64 = base64.b64encode(f.read()).decode()
                inputs.append({
                    "type": "image",
                    "content": f"data:image/png;base64,{b64}"
                })

    response = client.multimodal_embed(
        inputs=inputs,
        model="voyage-multimodal-3"
    )

    return response.embeddings

Hybrid RAG Pipeline

from typing import Optional
import numpy as np

class MultimodalRAG:
    """Production multimodal RAG with hybrid retrieval."""

    def __init__(self, vector_db, vision_model, text_model):
        self.vector_db = vector_db
        self.vision_model = vision_model
        self.text_model = text_model

    async def index_document(
        self,
        doc_id: str,
        text: Optional[str] = None,
        image_path: Optional[str] = None,
        metadata: dict = None
    ):
        """Index a document with text and/or image."""
        embeddings = []

        if text:
            text_emb = embed_text(text)
            embeddings.append(("text", text_emb))

        if image_path:
            # Option 1: Direct image embedding
            img_emb = embed_image(image_path)
            embeddings.append(("image", img_emb))

            # Option 2: Generate caption for text search
            caption = await self.generate_caption(image_path)
            caption_emb = embed_text(caption)
            embeddings.append(("caption", caption_emb))

        # Store with shared document ID
        for emb_type, emb in embeddings:
            await self.vector_db.upsert(
                id=f"{doc_id}_{emb_type}",
                embedding=emb,
                metadata={
                    "doc_id": doc_id,
                    "type": emb_type,
                    "image_url": image_path,
                    "text": text,
                    **(metadata or {})
                }
            )

    async def generate_caption(self, image_path: str) -> str:
        """Generate text caption for image indexing."""
        # Use GPT-4o or Claude for high-quality captions
        response = await self.vision_model.analyze(
            image_path,
            prompt="Describe this image in detail for search indexing. "
                   "Include objects, text, colors, and context."
        )
        return response

    async def retrieve(
        self,
        query: str,
        query_image: Optional[str] = None,
        top_k: int = 10
    ) -> list[dict]:
        """Hybrid retrieval with optional image query."""
        results = []

        # Text query embedding
        text_emb = embed_text(query)
        text_results = await self.vector_db.search(
            embedding=text_emb,
            top_k=top_k
        )
        results.extend(text_results)

        # Image query embedding (if provided)
        if query_image:
            img_emb = embed_image(query_image)
            img_results = await self.vector_db.search(
                embedding=img_emb,
                top_k=top_k
            )
            results.extend(img_results)

        # Dedupe by doc_id, keep highest score
        seen = {}
        for r in results:
            doc_id = r["metadata"]["doc_id"]
            if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
                seen[doc_id] = r

        return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]

Multimodal Document Chunking

from dataclasses import dataclass
from typing import Literal

@dataclass
class Chunk:
    content: str
    chunk_type: Literal["text", "image", "table", "chart"]
    page: int
    image_path: Optional[str] = None
    embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
    """Chunk PDF preserving images and tables."""
    from pdf2image import convert_from_path
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        current_text = ""

        for block in text_blocks:
            if block[6] == 0:  # Text block
                current_text += block[4] + "\n"
            else:  # Image block
                # Save current text chunk
                if current_text.strip():
                    chunks.append(Chunk(
                        content=current_text.strip(),
                        chunk_type="text",
                        page=page_num
                    ))
                    current_text = ""

                # Extract and save image
                xref = block[7]
                img = doc.extract_image(xref)
                img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
                with open(img_path, "wb") as f:
                    f.write(img["image"])

                # Generate caption for the image
                caption = generate_image_caption(img_path)

                chunks.append(Chunk(
                    content=caption,
                    chunk_type="image",
                    page=page_num,
                    image_path=img_path
                ))

        # Final text chunk
        if current_text.strip():
            chunks.append(Chunk(
                content=current_text.strip(),
                chunk_type="text",
                page=page_num
            ))

    return chunks

Vector Database Setup (Milvus)

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection():
    """Create Milvus collection for multimodal embeddings."""
    connections.connect("default", host="localhost", port="19530")

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
        FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
        FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
        FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
        FieldSchema(name="page", dtype=DataType.INT64)
    ]

    schema = CollectionSchema(fields, "Multimodal document collection")
    collection = Collection("multimodal_docs", schema)

    # Create index for vector search
    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 16, "efConstruction": 256}
    }
    collection.create_index("embedding", index_params)

    return collection

Multimodal Generation

async def generate_with_context(
    query: str,
    retrieved_chunks: list[Chunk],
    model: str = "claude-opus-4-5-20251124"
) -> str:
    """Generate response using multimodal context."""
    content = []

    # Add retrieved images first (attention positioning)
    for chunk in retrieved_chunks:
        if chunk.chunk_type == "image" and chunk.image_path:
            base64_data, media_type = encode_image_base64(chunk.image_path)
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            })

    # Add text context
    text_context = "\n\n".join([
        f"[Page {c.page}]: {c.content}"
        for c in retrieved_chunks if c.chunk_type == "text"
    ])

    content.append({
        "type": "text",
        "text": f"""Use the following context to answer the question.

Context:
{text_context}

Question: {query}

Provide a detailed answer based on the context and images provided."""
    })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )

    return response.content[0].text

Key Decisions

DecisionRecommendation
Long documentsVoyage multimodal-3 (32K context)
Scale retrievalSigLIP 2 (optimized for large-scale)
PDF processingColPali (document-native)
Multi-modal searchHybrid: CLIP + text embeddings
Production DBMilvus or Pinecone with hybrid

Common Mistakes

  • Embedding images without captions (limits text search)
  • Not deduplicating by document ID
  • Missing image URL storage (can't display results)
  • Using only image OR text embeddings (use both)
  • Ignoring chunk boundaries (split mid-paragraph)
  • Not validating image retrieval quality
  • vision-language-models - Image analysis
  • embeddings - Text embedding patterns
  • rag-retrieval - Text RAG patterns
  • contextual-retrieval - Hybrid BM25+vector

Capability Details

image-embeddings

Keywords: CLIP, image embedding, visual features, SigLIP Solves:

  • Convert images to vector representations
  • Enable image similarity search
  • Cross-modal retrieval

Keywords: text to image, image to text, cross-modal Solves:

  • Find images from text queries
  • Find text from image queries
  • Bridge modalities

multimodal-chunking

Keywords: chunk PDF, split document, extract images Solves:

  • Process documents with mixed content
  • Preserve image-text relationships
  • Handle tables and charts

hybrid-retrieval

Keywords: hybrid search, fusion, multi-embedding Solves:

  • Combine text and image search
  • Improve retrieval accuracy
  • Handle diverse queries

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon