dedupe-rank

Name: dedupe-rank
Rating: 70
Author: WILLOSCAR

by WILLOSCAR

Research pipelines as semantic execution units: each skill declares inputs/outputs, acceptance criteria, and guardrails. Evidence-first methodology prevents hollow writing through structured intermediate artifacts.

⭐ 83🍴 10📅 Jan 24, 2026

claude claude-code codex gpt pipeline research research-paper research-project

View on GitHub Run in Manus

SKILL.md

name: dedupe-rank description: | Dedupe and rank a raw paper set (`papers/papers_raw.jsonl`) to produce `papers/papers_dedup.jsonl` and `papers/core_set.csv`. Trigger: dedupe, rank, core set, 去重, 排序, 精选论文, 核心集合. Use when: 检索后需要把广覆盖集合收敛成可管理的 core set（用于 taxonomy/outline/mapping）。 Skip if: 已经有人手工整理了稳定的 `papers/core_set.csv`（无需再次 churn）。 Network: none. Guardrail: 偏 deterministic；输出应可重复（稳定 paper_id、字段规范）。

Dedupe + Rank

Turn a broad retrieved set into a smaller core set for taxonomy/outline building.

This is a deterministic “curation” step: it should be stable and repeatable.

Input

papers/papers_raw.jsonl

Outputs

papers/papers_dedup.jsonl
papers/core_set.csv

Workflow (high level)

Dedupe by normalized (title, year) and keep the richest metadata per duplicate cluster.
Rank by relevance/recency signals (and optionally pin known classics for certain topics). For LLM-agent topics, also ensure a small quota of prior surveys/reviews is present to support a paper-like Related Work section.
Write papers/core_set.csv with stable paper_id values and useful metadata columns (arxiv_id, pdf_url, categories).

Quality checklist

papers/papers_dedup.jsonl exists and is valid JSONL.
papers/core_set.csv exists and has a header row.

Script

Quick Start

python .codex/skills/dedupe-rank/scripts/run.py --help
python .codex/skills/dedupe-rank/scripts/run.py --workspace <workspace_dir> --core-size 50

All Options

--core-size <n>: target size for papers/core_set.csv
queries.md also supports core_size / core_set_size / dedupe_core_size (overrides default when present)

Examples

Smaller core set for fast iteration:
- python .codex/skills/dedupe-rank/scripts/run.py --workspace <ws> --core-size 25

Notes

This step may annotate papers/core_set.csv:reason with tags such as pinned_classic and prior_survey (deterministic, topic-aware guards for survey writing).
Systematic-review default: if the active pipeline is systematic-review and core_size is not specified, the script keeps the full deduped pool in papers/core_set.csv (so screening does not silently drop candidates).
This step is deterministic; reruns should be stable for the same inputs.

Troubleshooting

Common Issues

Issue: `papers/core_set.csv` is too small / empty

Symptom:

Core set has very few rows.

Causes:

Input papers/papers_raw.jsonl is small, or many rows are missing required fields.

Solutions:

Broaden retrieval (or provide a richer offline export) and rerun.
Lower --core-size only if you intentionally want a small core set.

Issue: Duplicates still appear after dedupe

Symptom:

Near-identical titles remain.

Causes:

Title normalization is defeated by noisy exports.

Solutions:

Clean title fields in the export (strip prefixes/suffixes, fix encoding) and rerun.

Recovery Checklist

papers/papers_raw.jsonl lines contain title/year/url.
papers/core_set.csv has stable paper_id values.

Score

Total Score

70/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

○LICENSE

ライセンスが設定されている

0/10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

✓最近の活動

1ヶ月以内に更新

+10

✓フォーク

10回以上フォークされている

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

dedupe-rank

SKILL.md

Dedupe + Rank

Input

Outputs

Workflow (high level)

Quality checklist

Script

Quick Start

All Options

Examples

Notes

Troubleshooting

Common Issues

Issue: `papers/core_set.csv` is too small / empty

Issue: Duplicates still appear after dedupe

Recovery Checklist

Score

Reviews

prompt-lookup

skill-lookup

changelog-automation

web-component-design

dbt-transformation-patterns

market-sizing-analysis

dedupe-rank

SKILL.md

Dedupe + Rank

Input

Outputs

Workflow (high level)

Quality checklist

Script

Quick Start

All Options

Examples

Notes

Troubleshooting

Common Issues

Issue: papers/core_set.csv is too small / empty

Issue: Duplicates still appear after dedupe

Recovery Checklist

Score

Reviews

Related

Related Skills

prompt-lookup

skill-lookup

changelog-automation

web-component-design

dbt-transformation-patterns

market-sizing-analysis

Issue: `papers/core_set.csv` is too small / empty