web-scraping

Name: web-scraping
Rating: 75
Author: cotdp

by cotdp

Context-optimized MCP server for web scraping. Reduces LLM token usage by 70-90% through server-side CSS filtering and HTML-to-markdown conversion.

⭐ 4🍴 0📅 2026年1月7日

ai-tools context-window fastmcp llm map markdown-conversion model-context-protocol python

GitHubで見る Manusで実行

SKILL.md

name: web-scraping description: Web scraping toolkit using MCP scraper tools. Invoked when extracting content from web pages, converting HTML to markdown, extracting plain text, or harvesting links from URLs. Provides four specialized tools for different extraction needs with CSS selector filtering, batch operations, and retry logic.

Web Scraping Skill

Toolkit for efficient web content extraction using the scraper MCP server tools.

When to Use This Skill

Extracting content from web pages for analysis
Converting web pages to markdown for LLM consumption
Extracting plain text from HTML documents
Harvesting links from web pages
Batch processing multiple URLs concurrently

Available Tools

Tool	Purpose	Best For
`mcp__scraper__scrape_url`	Convert HTML to markdown	LLM-friendly content extraction
`mcp__scraper__scrape_url_html`	Raw HTML content	DOM inspection, metadata extraction
`mcp__scraper__scrape_url_text`	Plain text extraction	Clean text without formatting
`mcp__scraper__scrape_extract_links`	Link harvesting	Site mapping, crawling

Tool Usage

1. Markdown Conversion (Recommended for LLMs)

Convert web pages to clean markdown format:

mcp__scraper__scrape_url(
    urls=["https://example.com/article"],
    css_selector=".article-content",
    timeout=30,
    max_retries=3
)

Response includes:

content: Markdown-formatted text
url: Final URL (after redirects)
status_code: HTTP status
metadata: Headers, timing, retry info

2. Raw HTML Extraction

Get unprocessed HTML for DOM analysis:

mcp__scraper__scrape_url_html(
    urls=["https://example.com"],
    css_selector="meta",
    timeout=30
)

Use cases:

Extracting meta tags and Open Graph data
Inspecting page structure
Getting specific HTML elements

3. Plain Text Extraction

Extract readable text without HTML markup:

mcp__scraper__scrape_url_text(
    urls=["https://example.com/page"],
    strip_tags=["script", "style", "nav", "footer"],
    css_selector="#main-content"
)

Parameters:

strip_tags: HTML elements to remove before extraction (default: script, style, meta, link, noscript)

4. Link Extraction

Harvest all links from a page:

mcp__scraper__scrape_extract_links(
    urls=["https://example.com"],
    css_selector="nav.primary"
)

Response includes:

links: Array of {url, text, title} objects
count: Total links found

CSS Selector Filtering

All tools support the css_selector parameter for targeted extraction.

Common Patterns

# By tag
css_selector="article"

# By class
css_selector=".main-content"

# By ID
css_selector="#article-body"

# By attribute
css_selector='meta[property^="og:"]'

# Multiple selectors
css_selector="h1, h2, h3"

# Nested elements
css_selector="article .content p"

# Pseudo-selectors
css_selector="p:first-of-type"

Example: Extract Open Graph Metadata

mcp__scraper__scrape_url_html(
    urls=["https://example.com"],
    css_selector='meta[property^="og:"], meta[name^="twitter:"]'
)

Batch Operations

Process multiple URLs concurrently by passing a list:

mcp__scraper__scrape_url(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ],
    css_selector=".content"
)

Response structure:

{
    "results": [...],
    "total": 3,
    "successful": 3,
    "failed": 0
}

Individual failures don't stop the batch - each result includes success/error status.

Retry Behavior

All tools implement exponential backoff:

Default retries: 3 attempts
Backoff schedule: 1s → 2s → 4s
Retryable errors: Timeouts, connection errors, HTTP errors

Override defaults when needed:

# Quick fail for time-sensitive scraping
mcp__scraper__scrape_url(
    urls=["https://api.example.com/data"],
    max_retries=1,
    timeout=10
)

# Patient scraping for unreliable sources
mcp__scraper__scrape_url(
    urls=["https://slow-site.com"],
    max_retries=5,
    timeout=60
)

Workflow Examples

Extract Article Content

# Get main article as markdown
mcp__scraper__scrape_url(
    urls=["https://blog.example.com/post"],
    css_selector="article.post-content"
)

Scrape Product Information

# Get product details
mcp__scraper__scrape_url_text(
    urls=["https://shop.example.com/product/123"],
    css_selector=".product-info, .price, .description"
)

# Extract all navigation links
mcp__scraper__scrape_extract_links(
    urls=["https://example.com"],
    css_selector="nav, footer"
)

Batch Research

# Process multiple sources
mcp__scraper__scrape_url(
    urls=[
        "https://source1.com/article",
        "https://source2.com/report",
        "https://source3.com/analysis"
    ],
    css_selector="article, .main-content, #content"
)

スコア

総合スコア

75/100

リポジトリの品質指標に基づく評価

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

3ヶ月以内に更新

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

レビュー

💬

レビュー機能は近日公開予定です

web-scraping

SKILL.md

Web Scraping Skill

When to Use This Skill

Available Tools

Tool Usage

1. Markdown Conversion (Recommended for LLMs)

2. Raw HTML Extraction

3. Plain Text Extraction

4. Link Extraction

CSS Selector Filtering

Common Patterns

Example: Extract Open Graph Metadata

Batch Operations

Retry Behavior

Workflow Examples

Extract Article Content

Scrape Product Information

Map Site Navigation

Batch Research

スコア

レビュー

browser-use

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

web-scraping

SKILL.md

Web Scraping Skill

When to Use This Skill

Available Tools

Tool Usage

1. Markdown Conversion (Recommended for LLMs)

2. Raw HTML Extraction

3. Plain Text Extraction

4. Link Extraction

CSS Selector Filtering

Common Patterns

Example: Extract Open Graph Metadata

Batch Operations

Retry Behavior

Workflow Examples

Extract Article Content

Scrape Product Information

Map Site Navigation

Batch Research

スコア

レビュー

関連

関連スキル

browser-use

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing