Back to list
xuzeyu91

office-to-md

by xuzeyu91

WebCode is a browser-based AI coding platform that lets you remotely run CLI assistants like Claude Code and Codex. Code anywhere with just a web browser. We’re also adding an office-assistant mode for planning, research/summaries, meeting notes, and drafting docs/emails—connected to your project context.

282🍴 25📅 Jan 23, 2026

SKILL.md


name: office-to-md description: Convert Office documents (Word, Excel, PowerPoint, PDF) to Markdown format. ONLY use this skill when the user explicitly requests to CONVERT, TRANSFORM or PARSE a specific office file into Markdown. Do NOT trigger for general questions, documentation reading, or discussions about files.

Office Document to Markdown Converter

Convert various Office document formats to structured Markdown with text, table, and image extraction.

File Description

  • enhanced_parser.py - Core document parser
  • doc_converter.py - DOC to DOCX converter (requires LibreOffice)
  • requirements.txt - Python dependencies

Install Dependencies

pip install -r requirements.txt

Additional Dependencies for DOC Format

.doc format requires LibreOffice:

# Windows: Install LibreOffice from official website
# https://www.libreoffice.org/download/

# Linux
sudo apt install libreoffice

# Mac
brew install --cask libreoffice

Quick Start

Python Code

from enhanced_parser import EnhancedDocumentParser

# Initialize parser
parser = EnhancedDocumentParser(
    image_base_url="http://localhost:5000",
    image_save_dir="./static/images",
    filter_headers_footers=True  # Filter headers and footers
)

# Parse document
result = parser.parse_document("document.docx")

if result["success"]:
    print(result["markdown"])
    print(f"Extracted {result['images_count']} images")

Start API Service

# Start service using app.py from project root
python app.py

# Visit http://localhost:5000/analyzer to upload files

Supported Formats

FormatExtensionsNotes
Word.docx, .doc.doc requires LibreOffice
Excel.xlsx, .xlsSupports multiple worksheets and date formats
PowerPoint.pptxExtracts slide text and images
PDF.pdfAuto-detects tables and images

Features

Word Documents

  • Automatic heading level detection
  • Convert tables to Markdown tables
  • Extract inline images
  • Filter headers and footers
  • Preserve list formatting

Excel Workbooks

  • Support for multiple worksheets
  • Automatic date format detection (prevents display as numbers)
  • Convert to Markdown tables
  • Extract embedded images

PowerPoint Presentations

  • Extract content by slide
  • Extract images and text boxes
  • Preserve slide order

PDF Documents

  • Auto-detect tables (line detection + text position detection)
  • Extract page images
  • Intelligently identify headings and lists
  • Output content in original order

Advanced Options

DOC Conversion

# Test LibreOffice configuration
python doc_converter.py

PDF Table Strategy

parser = EnhancedDocumentParser(
    pdf_table_strategy="lines_strict"  # Default: strict line detection, fastest
    # "lines": Normal line detection
    # "text": Based on text position, more accurate but slower
)

Image Processing

parser = EnhancedDocumentParser(
    image_base_url="https://your-domain.com",  # Image access URL
    image_save_dir="./static/images"           # Image save directory
)

Return Format

{
  "success": true,
  "markdown": "# Document Title\n\nContent...",
  "images_count": 2,
  "images": [
    {
      "filename": "uuid.png",
      "url": "http://localhost:5000/static/images/uuid.png",
      "size": 12345
    }
  ],
  "file_type": "docx",
  "file_info": {
    "name": "document.docx",
    "size": 45678,
    "paragraphs": 50,
    "tables": 3
  }
}

Common Issues

DOC Conversion Failed

  • Ensure LibreOffice is installed
  • Run python doc_converter.py to test configuration

Dates Display as Numbers

  • Excel parsing automatically handles date formats
  • Ensure you're using the latest version of enhanced_parser.py

PDF Table Recognition Inaccurate

  • Try different pdf_table_strategy parameters
  • Use "lines_strict" for standard tables
  • Use "text" for complex tables

File Limitations

  • Maximum file size: 160MB
  • Supported extensions: docx, doc, pdf, xlsx, xls, pptx
  • Automatic cleanup of temporary files

Score

Total Score

85/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

+5
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

+5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon