ESMCFold on val_filtered.lance (600K sequences)

Date: 2026-03-18
Branch: zeming/lance-mapper
Script: claude_scratchpad/fold_val_test.py
Dataset: /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance
Model: Janaury trainout of ESMCFold hero medium (24blk, 12 diffusion steps, no MSA, confidence-trained)
Checkpoint: conf_esmcfold_hero_medium_24blk_12diffu_no_msa_bs128_ctx512_mult2_noise1.1_step1.0_nodiffcond/epoch-0000-step-7000_cleaned.ckpt

Dataset

Property	Value
Input rows	601,760
Schema	`id` (string), `sequence` (string), `cluster_rep_50`, `max_seq_id_vs_train`
Sharding	`rows_per_shard=1000` → 602 shards
Max sequence length	1024 (longer sequences skipped)

Run Configuration

Setting	Value
Workers	16 GPUs (H100, dev QOS)
Batch sizing	Dynamic by seqlen: 32 (≤128), 16 (≤256), 8 (≤384), 4 (≤512), 2 (≤786), 1 (≤1024)
OOM handling	Fallback to one-by-one on batch OOM
Preemption	SIGUSR2 handler, flush + `scontrol requeue`

Output Schema

Each folded sequence produces:

id (string) — key column
ptm (float) — predicted TM-score
mean_plddt (float) — mean pLDDT
per_residue_plddt (binary) — npz-compressed float16 array
structure_blob (binary) — ESM structure blob
pdb_str (string) — PDB format string

Throughput

Metric	Value
Effective throughput	~20 rows/s aggregate across 16 GPUs
Per-GPU throughput	~1.25 rows/s (varies heavily with sequence length)
Output rows	581,613 / 601,760 (20,147 skipped — sequences > 1024 residues)
Output size	37 GB across 602 parquet shards (25-99 MB each), merged to result.lance

Observations

Long-tail shards: Static shard assignment caused 14/16 workers to finish early while 2 were stuck on long-sequence shards. Fixed by adding work stealing — workers that finish their assigned shards pick up remaining unfinished shards with .lock file coordination.
Work stealing contention: Multiple workers raced on the same last shards, duplicating work. Fixed with .lock files containing the SLURM job+task ID, with stale lock cleanup via squeue checks.

Merge

Merge uses multithreaded lance.fragment.write_fragments() + atomic commit (from lance_dataset.py pattern). Runs on the login node, no GPU needed.

Indexing

BTREE scalar index on the id column:

ds = lance.dataset("result.lance")
ds.create_scalar_index("id", index_type="BTREE")

Operation	Time
Index creation (581K rows)	1.3s
Single lookup	16ms
Batch 10	112ms (11ms/row)
Batch 100	221ms (2.2ms/row)
Batch 1000	2.0s (2.0ms/row)

Extrapolated to 1B rows: index creation ~37 min, lookup times somewhat higher but BTREE keeps it sub-linear.

Result Location

/bio/projects/es/zlin/atlas-folding/test-small.lance

Commands

# Full run
python claude_scratchpad/fold_val_test.py run \
    --input-dataset /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance \
    --output ~/tmp/fold-val-test/result.lance \
    --num-workers 16 --qos dev

# Check progress
python claude_scratchpad/fold_val_test.py status --output ~/tmp/fold-val-test/result.lance

# Single shard smoke test
srun --gres=gpu:1 --mem=32G -c 12 --qos dev -t 1:00:00 \
    pixi run python claude_scratchpad/fold_val_test.py run_single \
        --input-dataset /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance \
        --output ~/tmp/fold-val-test/result.lance --shard-ids 0

# Create index
python -c "import lance; ds = lance.dataset('result.lance'); ds.create_scalar_index('id', index_type='BTREE')"

ebetica/fold_val_report.md

Select an option

No results found