| name | description |
|---|---|
read-arxiv-paper |
Read an arXiv paper from an arXiv URL by downloading the /src archive, unpacking and traversing the LaTeX project recursively, extracting key experimental tables and the main pipeline figure, and writing/overwriting a single Obsidian-compatible markdown note with YAML frontmatter. Use when asked to digest an arXiv paper. |
- arXiv URL (abs/pdf/id forms). Example (any equivalent form ok):
- Optional:
- output_note_path (if provided, MUST update that note; do not create another)
- output_dir default: ./papers
- tag/slug: optional; default arxiv_{id}
- use_title_as_filename (default: true)
- filename_max_len (default: 80)
- filename_case (default: "kebab") # kebab|snake|preserve
- reserved_char_replacement (default: "-")
- folder_layout (default: "paper_dir") # always: md + assets under same folder
Sanitize rules (paper_name):
- Prefer title from TeX (
\title{...}); fallback toarxiv_{id}if title not found or use_title_as_filename=false. - Strip simple LaTeX commands (
\textbf{},\emph{}, etc.) and math ($...$) best-effort. - Replace forbidden characters:
\/:*?"<>|and control chars withreserved_char_replacement. - Collapse whitespace/separators to single
-(kebab) or_(snake). - Trim leading/trailing
-,_,., and spaces. - Truncate to
filename_max_lencharacters. - If result becomes empty ->
arxiv_{id}.
If output_note_path is NOT provided:
- paper_dir = {output_dir}/{paper_name}
- note_path = {paper_dir}/{paper_name}.md
- assets_dir = {paper_dir}/assets
If output_note_path IS provided:
- note_path = output_note_path (MUST write/update exactly this file)
- paper_dir = dirname(output_note_path)
- assets_dir = {paper_dir}/assets
All generated assets go under
assets_dirunless explicitly overridden.
- extract_pipeline_figure (default: true)
- max_pipeline_figures (default: 1)
- figure_keywords (default: ["pipeline","framework","overview","architecture","method","model","proposed", "system","approach","workflow","diagram", "框架","总体","架构","方法","流程","系统","概览"])
- figure_output_dir (default: assets_dir) # ✅ unified default
- figure_fallback_pdf (default: true)
- figure_raster_dpi (default: 250)
- figure_min_width_px (default: 800)
- Extract arxiv_id, allow vN suffix if present.
- Support both new-style (XXXX.XXXXX) and old-style (e.g., cs/0601001) IDs.
- Build canonical:
- abs_url: https://arxiv.org/abs/{arxiv_id}
- pdf_url: https://arxiv.org/pdf/{arxiv_id}.pdf
- src_url: https://arxiv.org/src/{arxiv_id}
- Cache file:
- ~/.cache/arxiv/src/{arxiv_id}.tar.gz
- If exists, reuse.
- Otherwise download from src_url.
- If /src is unavailable, fallback to PDF-only reading (last resort).
- Unpack to:
- ~/.cache/arxiv/src/{arxiv_id}/
- Keep directory for incremental reads.
- Use safe extraction: prevent path traversal (e.g., disallow
../escaping target dir).
Heuristics:
- Prefer file defining \title / \author / \begin{document}
- Common names: main.tex, paper.tex, manuscript.tex
- If multiple candidates, choose the one with \begin{document} and the most \input/\include.
- Extract raw_title from TeX entrypoint
\title{...}(best-effort). - If
use_title_as_filename=trueand raw_title exists:- paper_name = sanitize_title(raw_title)
- else paper_name =
arxiv_{arxiv_id}
- Derive paths using Derived Paths (single source of truth) above.
- Ensure directories exist:
- mkdir -p {paper_dir}
- mkdir -p {assets_dir}
- Set:
- figure_output_dir defaults to assets_dir (unless explicitly provided)
- Parse entrypoint, recursively inline:
- \input{...}, \include{...}
- Ignore figures/binaries unless needed.
- Capture:
- problem statement, contributions
- method & training details (data, objectives, architecture, compute)
- evaluation setup (datasets, metrics, baselines)
- key results, limitations, open questions
- Locate environments:
- \begin{table}...\end{table}
- \begin{tabular}...\end{tabular}
- Extract caption if available (\caption{...})
- Convert common patterns to Markdown table:
- columns separated by & and rows ended by \
- strip \hline, \toprule, \midrule, \bottomrule
- If conversion fails, include a fenced block with the raw LaTeX table and a short note about why it failed (e.g., multicolumn/multirow).
Create these tables if information exists:
- Datasets / Benchmarks | Dataset | Task | Split | Metric(s) | Notes |
- Main Results | Dataset | Metric | Baseline | Ours | Δ |
- Training / Compute (if reported) | Item | Value |
Goal: locate the paper’s main “pipeline/framework/overview” figure from TeX source assets.
- Collect candidate figures
- Scan recursively-read TeX content for
\begin{figure}/\begin{figure*}blocks. - For each block, extract:
- caption_text from
\caption{...}(best-effort) - label_text from
\label{...}(optional) - include_paths from
\includegraphics[...] {path}(may omit extension)
- caption_text from
- Resolve image file paths
- For each include path, attempt to resolve to an existing file in the unpacked source tree by trying extensions:
.pdf,.png,.jpg,.jpeg,.eps,.svg
- If the include path is relative, resolve against:
- the directory of the TeX file that contains it
- project root as fallback
- Score candidates (pick pipeline-like figure)
- Score:
- +5 for each keyword match in caption_text (case-insensitive)
- +3 for each keyword match in label_text
- +2 for each keyword match in filename
- +2 if caption contains “overview/framework/architecture” (or 中文同义词)
- -5 if caption suggests unrelated (e.g., “dataset”, “qualitative results”, “attention map”) unless also matches pipeline keywords
- Prefer larger images when possible:
- If you can read image width, add + (width_px / 1000) capped at +3
- Discard candidates too small (width < figure_min_width_px) unless nothing else exists.
- Export to Obsidian assets
- If
extract_pipeline_figure=false, skip this entire section. - Ensure
figure_output_direxists. - Choose up to
max_pipeline_figurestop-scoring candidates. - Copy into:
{figure_output_dir}/pipeline_{arxiv_id}.{ext}
- If target exists, do NOT overwrite; add suffix
_2,_3, etc.
- Optional: Convert vector to PNG for Obsidian
- If chosen asset is
.pdf/.eps/.svgand conversion tools are available, export:{figure_output_dir}/pipeline_{arxiv_id}.pngusingfigure_raster_dpi
- If conversion not possible, keep original and embed as-is.
- Record figure metadata
- pipeline_figure_path (relative to note, e.g.,
assets/pipeline_XXXX.XXXXX.png) - pipeline_figure_caption (caption_text if any)
- pipeline_figure_source (e.g.,
TeX includegraphics from <tex_file>)
Trigger fallback when:
- no suitable TeX figure found OR included image files missing,
AND
figure_fallback_pdf=true.
- Ensure PDF is available
- Download
{pdf_url}to cache if needed.
- Try image extraction from PDF (best-effort)
- Extract embedded images; filter out tiny ones using
figure_min_width_px. - Pick the largest / most pipeline-like (keyword-based if possible; otherwise size-based).
- Save as:
{figure_output_dir}/pipeline_{arxiv_id}.png(or jpg if extracted)
- If extraction fails, render likely page(s) to PNG
- Search PDF text for “Figure” + any
figure_keywords. - Render matching page(s) (whole page acceptable):
{figure_output_dir}/pipeline_{arxiv_id}_p{page}.png
- Note explicitly this is a page render if not cropped.
- Record metadata
- pipeline_figure_source =
PDF fallback (extracted image or rendered page) - pipeline_figure_caption = best-effort
Write/update EXACTLY note_path (derived above).
Default behavior: full rewrite (strong overwrite) for stability.
Optional preservation rule (only if you implement it):
- If the existing note contains:
<!-- USER_NOTES_START -->...<!-- USER_NOTES_END -->preserve that block verbatim when rewriting; otherwise overwrite everything.
Frontmatter fields (suggested):
- title: "..."
- authors: ["...", "..."]
- arxiv_id: "..."
- arxiv_url: "..."
- pdf_url: "..."
- published: YYYY-MM-DD
- updated: YYYY-MM-DD
- categories: ["cs.CV", ...]
- tags: ["paper/arxiv", "status/read", ...]
- status: "read"
- code: ""
- datasets: ["...", ...]
Frontmatter fields for pipeline figure (best-effort):
- pipeline_figure: "assets/pipeline_{arxiv_id}.png"
- pipeline_caption: "..."
- pipeline_source: "TeX includegraphics ... / PDF fallback ..."
Body sections (strict order):
- TL;DR (3-6 bullets)
- Key Contributions
- Method (with compact pseudocode or pipeline bullets)
- Pipeline Figure
- If pipeline_figure exists:
![[{pipeline_figure}]]- Caption: {pipeline_caption}
- Source: {pipeline_source}
- If pipeline_figure exists:
- Experiments
- Datasets table
- Main results table(s)
- Ablations/Analysis tables (if present)
- Limitations & Caveats
- Concrete Implementation Ideas (2-5 actionable ideas)
- Open Questions / Follow-ups
- Citation (bibtex if available or arXiv cite line)
- Do not hallucinate numbers: only include metrics that appear in the source.
- When unsure, mark as "not reported" and keep the evidence snippet.
- Keep tables faithful; avoid “cleaning” that changes meaning.
- For pipeline figure:
- Do not claim it is the pipeline figure unless caption/keywords strongly indicate it.
- If only a whole-page render is available, explicitly note it is a page render.