
pdf-text-extractor
by WILLOSCAR
Research pipelines as semantic execution units: each skill declares inputs/outputs, acceptance criteria, and guardrails. Evidence-first methodology prevents hollow writing through structured intermediate artifacts.
SKILL.md
name: pdf-text-extractor
description: |
Download PDFs (when available) and extract plain text to support full-text evidence, writing papers/fulltext_index.jsonl and papers/fulltext/*.txt.
Trigger: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF.
Use when: queries.md 设置 evidence_mode: fulltext(或你明确需要全文证据)并希望为 paper notes/claims 提供更强 evidence。
Skip if: evidence_mode: abstract(默认);或你不希望进行下载/抽取(成本/权限/时间)。
Network: fulltext 下载通常需要网络(除非你手工提供 PDF 缓存在 papers/pdfs/)。
Guardrail: 缓存下载到 papers/pdfs/;默认不覆盖已有抽取文本(除非显式要求重抽)。
PDF Text Extractor
Optionally collect full-text snippets to deepen evidence beyond abstracts.
This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.
Inputs
papers/core_set.csv(expectspaper_id,title, and ideallypdf_url/arxiv_id/url)- Optional:
outline/mapping.tsv(to prioritize mapped papers)
Outputs
papers/fulltext_index.jsonl(one record per attempted paper)- Side artifacts:
papers/pdfs/<paper_id>.pdf(cached downloads)papers/fulltext/<paper_id>.txt(extracted text)
Decision: evidence mode
queries.mdcan setevidence_mode: "abstract" | "fulltext".abstract(default template): do not download; write an index that clearly records skipping.fulltext: download PDFs (when possible) and extract text topapers/fulltext/.
Local PDFs Mode
When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.
- PDF naming convention:
papers/pdfs/<paper_id>.pdfwhere<paper_id>matchespapers/core_set.csv. - Set
- evidence_mode: "fulltext"inqueries.md. - Run:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
If PDFs are missing, the script writes a to-do list:
output/MISSING_PDFS.md(human-readable summary)papers/missing_pdfs.csv(machine-readable list)
Workflow (heuristic)
- Read
papers/core_set.csv. - If
outline/mapping.tsvexists, prioritize mapped papers first. - For each selected paper (fulltext mode):
- resolve
pdf_url(usepdf_url, else derive fromarxiv_id/urlwhen possible) - download to
papers/pdfs/<paper_id>.pdfif missing - extract a reasonable prefix of text to
papers/fulltext/<paper_id>.txt - append/update a JSONL record in
papers/fulltext_index.jsonlwith status + stats
- resolve
- Never overwrite existing extracted text unless explicitly requested (delete the
.txtto re-extract).
Quality checklist
-
papers/fulltext_index.jsonlexists and is non-empty. - If
evidence_mode: "fulltext": at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero). - If
evidence_mode: "abstract": the index records clearly reflect skip status (no downloads attempted).
Script
Quick Start
python .codex/skills/pdf-text-extractor/scripts/run.py --helppython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>
All Options
--max-papers <n>: cap number of papers processed (can be overridden byqueries.md)--max-pages <n>: extract at most N pages per PDF--min-chars <n>: minimum extracted chars to count as OK--sleep <sec>: delay between downloads--local-pdfs-only: do not download; only usepapers/pdfs/<paper_id>.pdfif presentqueries.mdsupports:evidence_mode,fulltext_max_papers,fulltext_max_pages,fulltext_min_chars
Examples
- Abstract mode (no downloads):
- Set
- evidence_mode: "abstract"inqueries.md, then run the script (it will emitpapers/fulltext_index.jsonlwith skip statuses)
- Set
- Fulltext mode with local PDFs only:
- Set
- evidence_mode: "fulltext"inqueries.md, put PDFs underpapers/pdfs/, then run:python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
- Set
- Fulltext mode with smaller budget:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200
Notes
- Downloads are cached under
papers/pdfs/; extracted text is cached underpapers/fulltext/. - The script does not overwrite existing extracted text unless you delete the
.txtfile.
Troubleshooting
Issue: no PDFs are available to download
Fix:
- Use
evidence_mode: abstract(default) or provide local PDFs underpapers/pdfs/and rerun with--local-pdfs-only.
Issue: extracted text is empty/garbled
Fix:
- Try a different extraction backend if supported; otherwise mark the paper as
abstractevidence level and avoid strong fulltext claims.
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon

