スキル一覧に戻る
cotdp

web-scraping

by cotdp

Context-optimized MCP server for web scraping. Reduces LLM token usage by 70-90% through server-side CSS filtering and HTML-to-markdown conversion.

4🍴 0📅 2026年1月7日
GitHubで見るManusで実行

SKILL.md


Web Scraping Skill

Toolkit for efficient web content extraction using the scraper MCP server tools.

When to Use This Skill

  • Extracting content from web pages for analysis
  • Converting web pages to markdown for LLM consumption
  • Extracting plain text from HTML documents
  • Harvesting links from web pages
  • Batch processing multiple URLs concurrently

Available Tools

ToolPurposeBest For
mcp__scraper__scrape_urlConvert HTML to markdownLLM-friendly content extraction
mcp__scraper__scrape_url_htmlRaw HTML contentDOM inspection, metadata extraction
mcp__scraper__scrape_url_textPlain text extractionClean text without formatting
mcp__scraper__scrape_extract_linksLink harvestingSite mapping, crawling

Tool Usage

Convert web pages to clean markdown format:

mcp__scraper__scrape_url(
    urls=["https://example.com/article"],
    css_selector=".article-content",
    timeout=30,
    max_retries=3
)

Response includes:

  • content: Markdown-formatted text
  • url: Final URL (after redirects)
  • status_code: HTTP status
  • metadata: Headers, timing, retry info

2. Raw HTML Extraction

Get unprocessed HTML for DOM analysis:

mcp__scraper__scrape_url_html(
    urls=["https://example.com"],
    css_selector="meta",
    timeout=30
)

Use cases:

  • Extracting meta tags and Open Graph data
  • Inspecting page structure
  • Getting specific HTML elements

3. Plain Text Extraction

Extract readable text without HTML markup:

mcp__scraper__scrape_url_text(
    urls=["https://example.com/page"],
    strip_tags=["script", "style", "nav", "footer"],
    css_selector="#main-content"
)

Parameters:

  • strip_tags: HTML elements to remove before extraction (default: script, style, meta, link, noscript)

Harvest all links from a page:

mcp__scraper__scrape_extract_links(
    urls=["https://example.com"],
    css_selector="nav.primary"
)

Response includes:

  • links: Array of {url, text, title} objects
  • count: Total links found

CSS Selector Filtering

All tools support the css_selector parameter for targeted extraction.

Common Patterns

# By tag
css_selector="article"

# By class
css_selector=".main-content"

# By ID
css_selector="#article-body"

# By attribute
css_selector='meta[property^="og:"]'

# Multiple selectors
css_selector="h1, h2, h3"

# Nested elements
css_selector="article .content p"

# Pseudo-selectors
css_selector="p:first-of-type"

Example: Extract Open Graph Metadata

mcp__scraper__scrape_url_html(
    urls=["https://example.com"],
    css_selector='meta[property^="og:"], meta[name^="twitter:"]'
)

Batch Operations

Process multiple URLs concurrently by passing a list:

mcp__scraper__scrape_url(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ],
    css_selector=".content"
)

Response structure:

{
    "results": [...],
    "total": 3,
    "successful": 3,
    "failed": 0
}

Individual failures don't stop the batch - each result includes success/error status.

Retry Behavior

All tools implement exponential backoff:

  • Default retries: 3 attempts
  • Backoff schedule: 1s → 2s → 4s
  • Retryable errors: Timeouts, connection errors, HTTP errors

Override defaults when needed:

# Quick fail for time-sensitive scraping
mcp__scraper__scrape_url(
    urls=["https://api.example.com/data"],
    max_retries=1,
    timeout=10
)

# Patient scraping for unreliable sources
mcp__scraper__scrape_url(
    urls=["https://slow-site.com"],
    max_retries=5,
    timeout=60
)

Workflow Examples

Extract Article Content

# Get main article as markdown
mcp__scraper__scrape_url(
    urls=["https://blog.example.com/post"],
    css_selector="article.post-content"
)

Scrape Product Information

# Get product details
mcp__scraper__scrape_url_text(
    urls=["https://shop.example.com/product/123"],
    css_selector=".product-info, .price, .description"
)

Map Site Navigation

# Extract all navigation links
mcp__scraper__scrape_extract_links(
    urls=["https://example.com"],
    css_selector="nav, footer"
)

Batch Research

# Process multiple sources
mcp__scraper__scrape_url(
    urls=[
        "https://source1.com/article",
        "https://source2.com/report",
        "https://source3.com/analysis"
    ],
    css_selector="article, .main-content, #content"
)

スコア

総合スコア

75/100

リポジトリの品質指標に基づく評価

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

3ヶ月以内に更新

+5
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

レビュー

💬

レビュー機能は近日公開予定です