web-scrape

Name: web-scrape
Rating: 60
Author: aiskillstore

by aiskillstore

Security-audited skills for Claude, Codex & Claude Code. One-click install, quality verified.

⭐ 102🍴 3📅 Jan 23, 2026

ai-skills claude claude-code claude-skills codex codex-skills skills utility-development

View on GitHub Run in Manus

SKILL.md

name: web-scrape description: Intelligent web scraper with content extraction, multiple output formats, and error handling version: 3.0.0

Web Scraping Skill v3.0

Usage

/web-scrape <url> [options]

Options:

--format=markdown|json|text - Output format (default: markdown)
--full - Include full page content (skip smart extraction)
--screenshot - Also save a screenshot
--scroll - Scroll to load dynamic content (infinite scroll pages)

Examples:

/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot

Execution Flow

Phase 1: Navigate and Load

1. mcp__playwright__browser_navigate
   url: "<target URL>"

2. mcp__playwright__browser_wait_for
   time: 2  (allow initial render)

If --scroll option: Execute scroll sequence to trigger lazy loading:

3. mcp__playwright__browser_evaluate
   function: "async () => {
     for (let i = 0; i < 3; i++) {
       window.scrollTo(0, document.body.scrollHeight);
       await new Promise(r => setTimeout(r, 1000));
     }
     window.scrollTo(0, 0);
   }"

Phase 2: Capture Content

4. mcp__playwright__browser_snapshot
   → Returns full accessibility tree with all text content

If --screenshot option:

5. mcp__playwright__browser_take_screenshot
   filename: "scraped_<domain>_<timestamp>.png"
   fullPage: true

Phase 3: Close Browser

6. mcp__playwright__browser_close

Smart Content Extraction

After getting the snapshot, apply intelligent extraction:

Step 1: Identify Content Type

Page Type	Indicators	Extraction Strategy
Article/Blog	`<article>`, long paragraphs, date/author	Extract main article body
Product Page	Price, "Add to Cart", specs	Extract title, price, description, specs
Documentation	Code blocks, headings hierarchy	Preserve structure and code
List/Search	Repeated item patterns	Extract as structured list
Landing Page	Hero section, CTAs	Extract key messaging

Step 2: Filter Noise

ALWAYS REMOVE these elements from output:

Navigation menus and breadcrumbs
Footer content (copyright, links)
Sidebars (ads, related articles, social links)
Cookie banners and popups
Comments section (unless specifically requested)
Share buttons and social widgets
Login/signup prompts

Step 3: Structure the Content

For Articles:

# [Title]

**Source:** [URL]
**Date:** [if available]
**Author:** [if available]

---

[Main content in clean markdown]

For Product Pages:

# [Product Name]

**Price:** [price]
**Availability:** [in stock/out of stock]

## Description
[product description]

## Specifications
| Spec | Value |
|------|-------|
| ... | ... |

Output Formats

Markdown (default)

Clean, readable markdown with proper headings, lists, and formatting.

JSON

{
  "url": "https://...",
  "title": "Page Title",
  "type": "article|product|docs|list",
  "content": {
    "main": "...",
    "metadata": {}
  },
  "extracted_at": "ISO timestamp"
}

Text

Plain text with minimal formatting, suitable for further processing.

Error Handling

Error	Detection	Action
Timeout	Page doesn't load in 30s	Report error, suggest retry
404 Not Found	"404" in title/content	Report "Page not found"
403 Forbidden	"403", "Access Denied"	Report access restriction
CAPTCHA	"captcha", "verify you're human"	Report CAPTCHA detected, cannot proceed
Paywall	"subscribe", "premium content"	Extract visible content, note paywall

Recovery Actions

If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly

If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable

Advanced Scenarios

Single Page Applications (SPA)

1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot

Infinite Scroll Pages

1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes

Pages with Click-to-Reveal Content

1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content

Multi-page Articles

1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine

Performance Guidelines

Metric	Target	How
Speed	< 15 seconds	Minimal waits, parallel where possible
Token Usage	< 5000 tokens	Smart extraction, not full DOM
Reliability	> 95% success	Proper error handling

Security Notes

Never execute arbitrary JavaScript from the page
Don't follow redirects to suspicious domains
Don't submit forms or click login buttons
Don't scrape pages that require authentication (unless user provides credentials flow)
Respect robots.txt when mentioned by user

Quick Reference

Minimum viable scrape (4 tool calls):

1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close

Full-featured scrape (with scroll + screenshot):

1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close

Remember: The goal is to deliver clean, useful content to the user, not raw HTML/DOM dumps.

Score

Total Score

60/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

○説明文

100文字以上の説明がある

0/10

✓人気

GitHub Stars 100以上

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

web-scrape

SKILL.md

name: web-scrape description: Intelligent web scraper with content extraction, multiple output formats, and error handling version: 3.0.0

Web Scraping Skill v3.0

Usage

Execution Flow

Phase 1: Navigate and Load

Phase 2: Capture Content

Phase 3: Close Browser

Smart Content Extraction

Step 1: Identify Content Type

Step 2: Filter Noise

Step 3: Structure the Content

Output Formats

Markdown (default)

JSON

Text

Error Handling

Navigation Errors

Recovery Actions

Advanced Scenarios

Single Page Applications (SPA)

Infinite Scroll Pages

Pages with Click-to-Reveal Content

Multi-page Articles

Performance Guidelines

Security Notes

Quick Reference

Score

Reviews

changelog-automation

web-component-design

dbt-transformation-patterns

market-sizing-analysis

on-call-handoff-patterns

architecture-decision-records

web-scrape

SKILL.md

name: web-scrape description: Intelligent web scraper with content extraction, multiple output formats, and error handling version: 3.0.0

Web Scraping Skill v3.0

Usage

Execution Flow

Phase 1: Navigate and Load

Phase 2: Capture Content

Phase 3: Close Browser

Smart Content Extraction

Step 1: Identify Content Type

Step 2: Filter Noise

Step 3: Structure the Content

Output Formats

Markdown (default)

JSON

Text

Error Handling

Navigation Errors

Recovery Actions

Advanced Scenarios

Single Page Applications (SPA)

Infinite Scroll Pages

Pages with Click-to-Reveal Content

Multi-page Articles

Performance Guidelines

Security Notes

Quick Reference

Score

Reviews

Related

Related Skills

changelog-automation

web-component-design

dbt-transformation-patterns

market-sizing-analysis

on-call-handoff-patterns

architecture-decision-records