Back to list
opendataloader-project

bench-debug

by opendataloader-project

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU

826🍴 45📅 Jan 22, 2026

SKILL.md


name: bench-debug description: Debug specific document parsing failures

/bench-debug <doc_id>

Compares parsing output with ground-truth for a specific document and analyzes failure causes.

Usage

/bench-debug 01030000000189

Execution Steps

  1. Run benchmark for the specific document

    ./scripts/bench.sh --doc-id <doc_id>
    
  2. Compare files

    • Ground-truth: tests/benchmark/ground-truth/markdown/<doc_id>.md
    • Prediction: tests/benchmark/prediction/opendataloader/markdown/<doc_id>.md
    • Original PDF: tests/benchmark/pdfs/<doc_id>.pdf
  3. Analyze differences

    • Missing/extra text locations
    • Table structure differences (TEDS score causes)
    • Heading level mismatches (MHS score causes)
    • Reading order errors (NID score causes)
  4. Identify root causes

    • Which PDF elements caused the issue
    • Which Java core components are involved
  5. Suggest improvements

    • Java classes/methods that need modification
    • Expected impact scope

Reference Files

  • ground-truth/reference.json: Per-document element info (categories, coordinates, etc.)
  • java/opendataloader-pdf-core/: Core parsing logic

Example Output

Document 01030000000189 Analysis:

Overall: 0.2763 (one of the worst performing documents)

Issues:
1. 2 of 3 tables not detected (TEDS: 0.15)
   - Table boundary detection failed
   - Related code: TableDetector.java

2. Reading order errors (NID: 0.45)
   - Multi-column layout handling failed
   - Related code: ColumnDetector.java

Recommended Actions:
- Adjust clustering threshold in TableDetector
- Improve multi-column detection logic

Score

Total Score

80/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

0/10
人気

GitHub Stars 500以上

+10
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

+5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon