Skip to content

Instantly share code, notes, and snippets.

@ebetica
Last active March 18, 2026 22:26
Show Gist options
  • Select an option

  • Save ebetica/12f5e3ee1219c1063aab06bc34b05407 to your computer and use it in GitHub Desktop.

Select an option

Save ebetica/12f5e3ee1219c1063aab06bc34b05407 to your computer and use it in GitHub Desktop.
ESMCFold on val_filtered.lance (600K sequences) — lance_mapper benchmark

ESMCFold on val_filtered.lance (600K sequences)

  • Date: 2026-03-18
  • Branch: zeming/lance-mapper
  • Script: claude_scratchpad/fold_val_test.py
  • Dataset: /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance
  • Model: Janaury trainout of ESMCFold hero medium (24blk, 12 diffusion steps, no MSA, confidence-trained)
  • Checkpoint: conf_esmcfold_hero_medium_24blk_12diffu_no_msa_bs128_ctx512_mult2_noise1.1_step1.0_nodiffcond/epoch-0000-step-7000_cleaned.ckpt

Dataset

Property Value
Input rows 601,760
Schema id (string), sequence (string), cluster_rep_50, max_seq_id_vs_train
Sharding rows_per_shard=1000 → 602 shards
Max sequence length 1024 (longer sequences skipped)

Run Configuration

Setting Value
Workers 16 GPUs (H100, dev QOS)
Batch sizing Dynamic by seqlen: 32 (≤128), 16 (≤256), 8 (≤384), 4 (≤512), 2 (≤786), 1 (≤1024)
OOM handling Fallback to one-by-one on batch OOM
Preemption SIGUSR2 handler, flush + scontrol requeue

Output Schema

Each folded sequence produces:

  • id (string) — key column
  • ptm (float) — predicted TM-score
  • mean_plddt (float) — mean pLDDT
  • per_residue_plddt (binary) — npz-compressed float16 array
  • structure_blob (binary) — ESM structure blob
  • pdb_str (string) — PDB format string

Throughput

Metric Value
Effective throughput ~20 rows/s aggregate across 16 GPUs
Per-GPU throughput ~1.25 rows/s (varies heavily with sequence length)
Output rows 581,613 / 601,760 (20,147 skipped — sequences > 1024 residues)
Output size 37 GB across 602 parquet shards (25-99 MB each), merged to result.lance

Observations

  1. Long-tail shards: Static shard assignment caused 14/16 workers to finish early while 2 were stuck on long-sequence shards. Fixed by adding work stealing — workers that finish their assigned shards pick up remaining unfinished shards with .lock file coordination.

  2. Work stealing contention: Multiple workers raced on the same last shards, duplicating work. Fixed with .lock files containing the SLURM job+task ID, with stale lock cleanup via squeue checks.

Merge

Merge uses multithreaded lance.fragment.write_fragments() + atomic commit (from lance_dataset.py pattern). Runs on the login node, no GPU needed.

Indexing

BTREE scalar index on the id column:

ds = lance.dataset("result.lance")
ds.create_scalar_index("id", index_type="BTREE")
Operation Time
Index creation (581K rows) 1.3s
Single lookup 16ms
Batch 10 112ms (11ms/row)
Batch 100 221ms (2.2ms/row)
Batch 1000 2.0s (2.0ms/row)

Extrapolated to 1B rows: index creation ~37 min, lookup times somewhat higher but BTREE keeps it sub-linear.

Result Location

/bio/projects/es/zlin/atlas-folding/test-small.lance

Commands

# Full run
python claude_scratchpad/fold_val_test.py run \
    --input-dataset /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance \
    --output ~/tmp/fold-val-test/result.lance \
    --num-workers 16 --qos dev

# Check progress
python claude_scratchpad/fold_val_test.py status --output ~/tmp/fold-val-test/result.lance

# Single shard smoke test
srun --gres=gpu:1 --mem=32G -c 12 --qos dev -t 1:00:00 \
    pixi run python claude_scratchpad/fold_val_test.py run_single \
        --input-dataset /bio/projects/es/zlin/esmc2_datasets/260312_uniref_seqonly/val_filtered.lance \
        --output ~/tmp/fold-val-test/result.lance --shard-ids 0

# Create index
python -c "import lance; ds = lance.dataset('result.lance'); ds.create_scalar_index('id', index_type='BTREE')"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment