pdf-reader

Name: pdf-reader
Rating: 55
Author: ngnnah

by ngnnah

my remaining 2500 weeks

⭐ 1🍴 0📅 Jan 22, 2026

compound-interest consistency learn-by-teaching learn-in-public reflect review vision

View on GitHub Run in Manus

SKILL.md

name: pdf-reader description: Extract and read text content from PDF files. Use when working with PDF documents, extracting text, analyzing PDF content, or when user mentions reading PDFs. Requires PyPDF2 or pdfplumber packages.

PDF Reader

This skill helps you extract and read text content from PDF files using Python libraries.

When to use this skill

Use this skill when:

Reading text content from PDF files
Extracting specific pages from PDFs
Analyzing PDF document structure
Converting PDF text to plain text
The user mentions "PDF", "read PDF", "extract text from PDF"

Requirements

Install required packages:

pip install PyPDF2 pdfplumber

Quick start

Basic text extraction with PyPDF2

import PyPDF2

def read_pdf_pypdf2(pdf_path):
    """Extract all text from a PDF file using PyPDF2"""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Get number of pages
        num_pages = len(pdf_reader.pages)
        print(f"PDF has {num_pages} pages")
        
        # Extract text from all pages
        full_text = ""
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text = page.extract_text()
            full_text += f"\n--- Page {page_num + 1} ---\n{text}"
        
        return full_text

# Usage
text = read_pdf_pypdf2("document.pdf")
print(text)

Advanced extraction with pdfplumber

pdfplumber provides better text extraction and table detection:

import pdfplumber

def read_pdf_pdfplumber(pdf_path):
    """Extract text with better formatting using pdfplumber"""
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        
        for i, page in enumerate(pdf.pages):
            # Extract text
            text = page.extract_text()
            full_text += f"\n--- Page {i + 1} ---\n{text}\n"
            
            # Optionally extract tables
            tables = page.extract_tables()
            if tables:
                full_text += f"\n[Found {len(tables)} table(s) on page {i + 1}]\n"
        
        return full_text

# Usage
text = read_pdf_pdfplumber("document.pdf")
print(text)

Extract specific pages

def read_pdf_pages(pdf_path, page_numbers):
    """Extract text from specific pages only"""
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page_num in page_numbers:
            if 0 <= page_num < len(pdf.pages):
                page = pdf.pages[page_num]
                text += f"\n--- Page {page_num + 1} ---\n"
                text += page.extract_text()
            else:
                print(f"Warning: Page {page_num + 1} doesn't exist")
        return text

# Usage: Read pages 1, 3, and 5 (0-indexed: 0, 2, 4)
text = read_pdf_pages("document.pdf", [0, 2, 4])
print(text)

Get PDF metadata

def get_pdf_info(pdf_path):
    """Get metadata and information about the PDF"""
    with pdfplumber.open(pdf_path) as pdf:
        info = {
            'num_pages': len(pdf.pages),
            'metadata': pdf.metadata,
        }
        
        # Get dimensions of first page
        if pdf.pages:
            first_page = pdf.pages[0]
            info['page_width'] = first_page.width
            info['page_height'] = first_page.height
    
    return info

# Usage
info = get_pdf_info("document.pdf")
print(f"Pages: {info['num_pages']}")
print(f"Title: {info['metadata'].get('Title', 'N/A')}")
print(f"Author: {info['metadata'].get('Author', 'N/A')}")

Common use cases

Search for text in PDF

def search_in_pdf(pdf_path, search_term):
    """Search for a term and return pages where it appears"""
    results = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if search_term.lower() in text.lower():
                results.append({
                    'page': i + 1,
                    'text_snippet': text[:200]  # First 200 chars as preview
                })
    
    return results

# Usage
results = search_in_pdf("document.pdf", "important keyword")
for result in results:
    print(f"Found on page {result['page']}")

Extract tables from PDF

def extract_tables(pdf_path):
    """Extract all tables from PDF"""
    all_tables = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for j, table in enumerate(tables):
                all_tables.append({
                    'page': i + 1,
                    'table_number': j + 1,
                    'data': table
                })
    
    return all_tables

# Usage
tables = extract_tables("document.pdf")
for table_info in tables:
    print(f"Table {table_info['table_number']} from page {table_info['page']}")
    print(table_info['data'])

Tips and best practices

Choose the right library:
- Use PyPDF2 for simple text extraction and PDF manipulation
- Use pdfplumber for better text extraction and table detection
- Use both if needed for different tasks

Handle errors gracefully:

try:
    text = read_pdf_pdfplumber("document.pdf")
except FileNotFoundError:
    print("PDF file not found")
except Exception as e:
    print(f"Error reading PDF: {e}")

Memory management: For large PDFs, process pages one at a time instead of loading all text at once
Text quality: Some PDFs (especially scanned images) may not have extractable text. Consider OCR tools like pytesseract for those cases.

Troubleshooting

No text extracted: The PDF might be image-based. Use OCR tools.
Garbled text: Try pdfplumber instead of PyPDF2, it often handles formatting better.
Missing packages: Run pip install PyPDF2 pdfplumber

For PDF form filling: Consider creating a pdf-forms skill
For PDF merging/splitting: Consider creating a pdf-manipulation skill
For OCR on image PDFs: Consider using pytesseract with pdf2image

Score

Total Score

55/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

○説明文

100文字以上の説明がある

0/10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

pdf-reader

SKILL.md

name: pdf-reader description: Extract and read text content from PDF files. Use when working with PDF documents, extracting text, analyzing PDF content, or when user mentions reading PDFs. Requires PyPDF2 or pdfplumber packages.

PDF Reader

When to use this skill

Requirements

Quick start

Basic text extraction with PyPDF2

Advanced extraction with pdfplumber

Extract specific pages

Get PDF metadata

Common use cases

Search for text in PDF

Extract tables from PDF

Tips and best practices

Troubleshooting

Score

Reviews

learning-content-summarizer

skill-writer

screenshot-reference

checking-urls

code-refactoring

team-collaboration

pdf-reader

SKILL.md

name: pdf-reader description: Extract and read text content from PDF files. Use when working with PDF documents, extracting text, analyzing PDF content, or when user mentions reading PDFs. Requires PyPDF2 or pdfplumber packages.

PDF Reader

When to use this skill

Requirements

Quick start

Basic text extraction with PyPDF2

Advanced extraction with pdfplumber

Extract specific pages

Get PDF metadata

Common use cases

Search for text in PDF

Extract tables from PDF

Tips and best practices

Troubleshooting

Related skills

Score

Reviews

Related

Related Skills

learning-content-summarizer

skill-writer

screenshot-reference

checking-urls

code-refactoring

team-collaboration