vision-language-models

Name: vision-language-models
Rating: 75
Author: yonatangross

by yonatangross

The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.

⭐ 29🍴 4📅 Jan 23, 2026

agents ai-development claude-code claude-plugin fastapi langgraph llm mcp

View on GitHub Run in Manus

SKILL.md

name: vision-language-models description: GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4 vision patterns for image analysis, document understanding, and visual QA. Use when implementing image captioning, document/chart analysis, or multi-image comparison. context: fork agent: multimodal-specialist version: 1.0.0 author: OrchestKit user-invocable: false tags: [vision, multimodal, image, gpt-5, claude-4, gemini, grok, vlm, 2026]

Vision Language Models (2026)

Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.

Overview

Image captioning and description generation
Visual question answering (VQA)
Document/chart/diagram analysis with OCR
Multi-image comparison and reasoning
Bounding box detection and region analysis
Video frame analysis

Model Comparison (January 2026)

Model	Context	Strengths	Vision Input
GPT-5.2	128K	Best general reasoning, multimodal	Up to 10 images
Claude Opus 4.5	200K	Best coding, sustained agent tasks	Up to 100 images
Gemini 2.5 Pro	1M+	Longest context, video analysis	3,600 images max
Gemini 3 Pro	1M	Deep Think, 100% AIME 2025	Enhanced segmentation
Grok 4	2M	Real-time X integration, DeepSearch	Images + upcoming video

Image Input Methods

Base64 Encoding (All Providers)

import base64
import mimetypes

def encode_image_base64(image_path: str) -> tuple[str, str]:
    """Encode local image to base64 with MIME type."""
    mime_type, _ = mimetypes.guess_type(image_path)
    mime_type = mime_type or "image/png"

    with open(image_path, "rb") as f:
        base64_data = base64.standard_b64encode(f.read()).decode("utf-8")

    return base64_data, mime_type

OpenAI GPT-5/4o Vision

from openai import OpenAI

client = OpenAI()

def analyze_image_openai(image_path: str, prompt: str) -> str:
    """Analyze image using GPT-5 or GPT-4o."""
    base64_data, mime_type = encode_image_base64(image_path)

    response = client.chat.completions.create(
        model="gpt-5",  # or "gpt-4o", "gpt-4.1"
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime_type};base64,{base64_data}",
                    "detail": "high"  # low, high, or auto
                }}
            ]
        }],
        max_tokens=4096  # Required for vision
    )
    return response.choices[0].message.content

Claude 4.5 Vision (Anthropic)

import anthropic

client = anthropic.Anthropic()

def analyze_image_claude(image_path: str, prompt: str) -> str:
    """Analyze image using Claude Opus 4.5 or Sonnet 4.5."""
    base64_data, media_type = encode_image_base64(image_path)

    response = client.messages.create(
        model="claude-opus-4-5-20251124",  # or claude-sonnet-4-5
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": base64_data
                    }
                },
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text

Gemini 2.5/3 Vision (Google)

import google.generativeai as genai
from PIL import Image

genai.configure(api_key="YOUR_API_KEY")

def analyze_image_gemini(image_path: str, prompt: str) -> str:
    """Analyze image using Gemini 2.5 Pro or Gemini 3."""
    model = genai.GenerativeModel("gemini-2.5-pro")  # or gemini-3-pro

    image = Image.open(image_path)

    response = model.generate_content([prompt, image])
    return response.text

# For video analysis (Gemini excels here)
def analyze_video_gemini(video_path: str, prompt: str) -> str:
    """Analyze video using Gemini's native video support."""
    model = genai.GenerativeModel("gemini-2.5-pro")

    video_file = genai.upload_file(video_path)

    response = model.generate_content([prompt, video_file])
    return response.text

Grok 4 Vision (xAI)

from openai import OpenAI  # Grok uses OpenAI-compatible API

client = OpenAI(
    api_key="YOUR_XAI_API_KEY",
    base_url="https://api.x.ai/v1"
)

def analyze_image_grok(image_path: str, prompt: str) -> str:
    """Analyze image using Grok 4 with real-time capabilities."""
    base64_data, mime_type = encode_image_base64(image_path)

    response = client.chat.completions.create(
        model="grok-4",  # or grok-2-vision-1212
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime_type};base64,{base64_data}"
                }}
            ]
        }]
    )
    return response.choices[0].message.content

Multi-Image Analysis

async def compare_images(images: list[str], prompt: str) -> str:
    """Compare multiple images (Claude supports up to 100)."""
    content = []

    for img_path in images:
        base64_data, media_type = encode_image_base64(img_path)
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }
        })

    content.append({"type": "text", "text": prompt})

    response = client.messages.create(
        model="claude-opus-4-5-20251124",
        max_tokens=8192,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

Object Detection (Gemini 2.5+)

def detect_objects_gemini(image_path: str) -> list[dict]:
    """Detect objects with bounding boxes using Gemini 2.5+."""
    model = genai.GenerativeModel("gemini-2.5-pro")
    image = Image.open(image_path)

    response = model.generate_content([
        "Detect all objects in this image. Return bounding boxes "
        "as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}",
        image
    ])

    import json
    return json.loads(response.text)

Token Cost Optimization

Provider	Detail Level	Cost Impact
OpenAI	`low` (65 tokens)	Use for classification
OpenAI	`high` (129+ tokens/tile)	Use for OCR/charts
Gemini	258 tokens base	Scales with resolution
Claude	Per-image pricing	Batch for efficiency

# Cost-optimized simple classification
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Cheaper for simple tasks
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Is there a person? Reply: yes/no"},
            {"type": "image_url", "image_url": {
                "url": image_url,
                "detail": "low"  # Minimal tokens
            }}
        ]
    }]
)

Image Size Limits (2026)

Provider	Max Size	Max Images	Notes
OpenAI	20MB	10/request	GPT-5 series
Claude	8000x8000 px	100/request	2000px if >20 images
Gemini	20MB	3,600/request	Best for batch
Grok	20MB	Limited	Grok 5 expands this

Key Decisions

Decision	Recommendation
High accuracy	Claude Opus 4.5 or GPT-5
Long documents	Gemini 2.5 Pro (1M context)
Cost efficiency	Gemini 2.5 Flash ($0.15/M tokens)
Real-time/X data	Grok 4 with DeepSearch
Video analysis	Gemini 2.5/3 Pro (native)

Common Mistakes

Not setting max_tokens (responses truncated)
Sending oversized images (resize to 2048px max)
Using high detail for yes/no questions
Not validating image format before encoding
Ignoring rate limits on vision endpoints
Using deprecated models (GPT-4V retired)

Limitations

Cannot identify specific people (privacy restriction)
May hallucinate on low-quality/rotated images (<200px)
GPT-4o: struggles with non-Latin text, precise spatial reasoning
No real-time video (use frame extraction except Gemini)

audio-language-models - Audio/speech processing
multimodal-rag - Image + text retrieval
llm-streaming - Streaming vision responses

Capability Details

image-captioning

Keywords: caption, describe, image description, alt text, accessibility Solves:

Generate descriptive captions for images
Create accessibility alt text
Extract visual content summary

visual-qa

Keywords: VQA, visual question, image question, analyze image Solves:

Answer questions about image content
Extract specific information from visuals
Reason about image elements

document-vision

Keywords: document, PDF, chart, diagram, OCR, extract, table Solves:

Extract text from documents and charts
Analyze diagrams and flowcharts
Process forms and tables with structure

multi-image-analysis

Keywords: compare images, multiple images, image comparison, batch Solves:

Compare visual elements across images
Track changes between versions
Analyze image sequences

object-detection

Keywords: bounding box, detect objects, locate, segmentation Solves:

Detect and locate objects in images
Generate bounding box coordinates
Segment image regions (Gemini 2.5+)

Score

Total Score

75/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

vision-language-models

SKILL.md

Vision Language Models (2026)

Overview

Model Comparison (January 2026)

Image Input Methods

Base64 Encoding (All Providers)

OpenAI GPT-5/4o Vision

Claude 4.5 Vision (Anthropic)

Gemini 2.5/3 Vision (Google)

Grok 4 Vision (xAI)

Multi-Image Analysis

Object Detection (Gemini 2.5+)

Token Cost Optimization

Image Size Limits (2026)

Key Decisions

Common Mistakes

Limitations

Capability Details

image-captioning

visual-qa

document-vision

multi-image-analysis

object-detection

Score

Reviews

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices

vision-language-models

SKILL.md

Vision Language Models (2026)

Overview

Model Comparison (January 2026)

Image Input Methods

Base64 Encoding (All Providers)

OpenAI GPT-5/4o Vision

Claude 4.5 Vision (Anthropic)

Gemini 2.5/3 Vision (Google)

Grok 4 Vision (xAI)

Multi-Image Analysis

Object Detection (Gemini 2.5+)

Token Cost Optimization

Image Size Limits (2026)

Key Decisions

Common Mistakes

Limitations

Related Skills

Capability Details

image-captioning

visual-qa

document-vision

multi-image-analysis

object-detection

Score

Reviews

Related

Related Skills

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

vercel-react-best-practices