debug-scraping

Name: debug-scraping
Rating: 65
Author: poodle64

by poodle64

Zero-infrastructure web scraping for the terminal

⭐ 0🍴 0📅 Jan 25, 2026

cli crawler llm markdown playwright python scraper terminal

View on GitHub Run in Manus

SKILL.md

name: debug-scraping description: Diagnose and fix web scraping failures in Supacrawl. Use when scraping fails, returns empty content, times out, gets blocked by anti-bot protection, or produces unexpected results. allowed-tools: Bash, Read, Grep, Glob, WebFetch

Debug Scraping Issues

Systematic diagnosis of web scraping failures in Supacrawl.

When This Skill Activates

Scraping returns empty or incomplete content
Timeout errors during page load
Anti-bot detection or CAPTCHA challenges
JavaScript content not rendering
Unexpected HTTP errors (403, 429, 503)
Content structure doesn't match expectations

Diagnostic Process

Step 1: Reproduce the Issue

First, reproduce with debug logging enabled:

SUPACRAWL_LOG_LEVEL=DEBUG supacrawl scrape "URL" --format markdown

Capture:

The exact error message
The URL being scraped
Any correlation ID in the error

Step 2: Categorise the Failure

Symptom	Likely Cause	Jump To
Empty markdown output	JS not rendered, content in iframe	Step 3a
Timeout error	Slow page, wait strategy wrong	Step 3b
403/Access Denied	Anti-bot detection	Step 3c
429 Too Many Requests	Rate limiting	Step 3d
Connection refused	Network/proxy issue	Step 3e
Wrong content extracted	Selector/conversion issue	Step 3f

Step 3a: JavaScript Rendering Issues

Symptoms: Empty content, "Loading..." text, missing dynamic elements

Diagnosis:

# Try with longer wait
supacrawl scrape "URL" --wait-for 5000

# Try networkidle wait strategy
supacrawl scrape "URL" --wait-until networkidle

Check:

Does the site require JavaScript? View source vs rendered DOM
Is content loaded via XHR/fetch after page load?
Is content in an iframe?

Fixes:

Increase --wait-for time for slow JS
Use --wait-until networkidle for XHR-heavy sites
Check if content is in iframe (Playwright won't cross iframe boundaries by default)

Step 3b: Timeout Issues

Symptoms: "Timeout waiting for page", operation cancelled

Diagnosis:

# Check with extended timeout
supacrawl scrape "URL" --timeout 60000

Check:

Is the site actually slow or unresponsive?
Is there a redirect chain?
Is the network stable?

Fixes:

Increase timeout: --timeout 60000 (60 seconds)
Use --wait-until load instead of networkidle for sites that never stop loading
Check for infinite redirect loops

Step 3c: Anti-Bot Detection

Symptoms: 403 Forbidden, CAPTCHA page, "Access Denied", Cloudflare challenge

Diagnosis:

# Try with stealth mode
supacrawl scrape "URL" --stealth

# Check what the bot sees
supacrawl scrape "URL" --format rawHtml | head -100

Check:

Does the raw HTML show a CAPTCHA or challenge page?
Is Cloudflare/Akamai/PerimeterX protection active?
Are browser fingerprints being detected?

Fixes:

Enable stealth mode: --stealth (uses Patchright)
Slow down requests if scraping multiple pages
Some sites require human verification - these cannot be scraped automatically

Step 3d: Rate Limiting

Symptoms: 429 errors, temporary blocks, "Too Many Requests"

Diagnosis:

Check if error occurs on first request or after multiple
Check response headers for rate limit info

Fixes:

Add delays between requests when crawling
Respect Retry-After headers
Reduce concurrency

Step 3e: Network Issues

Symptoms: Connection refused, DNS resolution failed, SSL errors

Diagnosis:

# Test basic connectivity
curl -I "URL"

# Check DNS
dig domain.com

Check:

Is the site actually accessible?
Is there a proxy configuration issue?
Are there SSL certificate problems?

Fixes:

Verify URL is correct and site is up
Check proxy settings if using one
For SSL issues, check certificate validity

Step 3f: Content Extraction Issues

Symptoms: Content extracted but wrong/incomplete, formatting broken

Diagnosis:

# Get raw HTML to inspect
supacrawl scrape "URL" --format rawHtml > page.html

# Compare with markdown
supacrawl scrape "URL" --format markdown > page.md

Check:

Is the content present in raw HTML?
Is the markdown converter handling it correctly?
Are there encoding issues?

Fixes:

Check if only_main_content is excluding desired content
Look for unusual HTML structures that confuse the converter
Check for encoding issues in source

Code Investigation

If the issue is in Supacrawl itself, investigate:

Component	Location	Purpose
Browser management	`src/supacrawl/services/browser.py`	Playwright lifecycle, page fetching
Content extraction	`src/supacrawl/services/scrape.py`	Main scrape logic
Markdown conversion	`src/supacrawl/services/converter.py`	HTML to Markdown
Stealth mode	Uses Patchright instead of Playwright	Anti-detection

Common Patterns

Site Uses Heavy JavaScript

supacrawl scrape "URL" --wait-for 5000 --wait-until networkidle

Site Has Anti-Bot Protection

supacrawl scrape "URL" --stealth

Site Is Slow

supacrawl scrape "URL" --timeout 60000 --wait-until load

Need to Debug What Browser Sees

supacrawl scrape "URL" --format screenshot --output debug.png

Escalation

If none of the above resolves the issue:

Check GitHub Issues: Similar problem may be reported
Capture debug output: SUPACRAWL_LOG_LEVEL=DEBUG with full output
Test in real browser: Does the URL work in Chrome/Firefox?
Create minimal reproduction: Single URL that demonstrates the issue

Score

Total Score

65/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

debug-scraping

SKILL.md

name: debug-scraping description: Diagnose and fix web scraping failures in Supacrawl. Use when scraping fails, returns empty content, times out, gets blocked by anti-bot protection, or produces unexpected results. allowed-tools: Bash, Read, Grep, Glob, WebFetch

Debug Scraping Issues

When This Skill Activates

Diagnostic Process

Step 1: Reproduce the Issue

Step 2: Categorise the Failure

Step 3a: JavaScript Rendering Issues

Step 3b: Timeout Issues

Step 3c: Anti-Bot Detection

Step 3d: Rate Limiting

Step 3e: Network Issues

Step 3f: Content Extraction Issues

Code Investigation

Common Patterns

Site Uses Heavy JavaScript

Site Has Anti-Bot Protection

Site Is Slow

Need to Debug What Browser Sees

Escalation

Score

Reviews

browser-use

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing

debug-scraping

SKILL.md

name: debug-scraping description: Diagnose and fix web scraping failures in Supacrawl. Use when scraping fails, returns empty content, times out, gets blocked by anti-bot protection, or produces unexpected results. allowed-tools: Bash, Read, Grep, Glob, WebFetch

Debug Scraping Issues

When This Skill Activates

Diagnostic Process

Step 1: Reproduce the Issue

Step 2: Categorise the Failure

Step 3a: JavaScript Rendering Issues

Step 3b: Timeout Issues

Step 3c: Anti-Bot Detection

Step 3d: Rate Limiting

Step 3e: Network Issues

Step 3f: Content Extraction Issues

Code Investigation

Common Patterns

Site Uses Heavy JavaScript

Site Has Anti-Bot Protection

Site Is Slow

Need to Debug What Browser Sees

Escalation

Score

Reviews

Related

Related Skills

browser-use

orpc-contract-first

component-refactoring

web-design-guidelines

frontend-code-review

frontend-testing