Skip to content

Instantly share code, notes, and snippets.

View boru-roylu's full-sized avatar
🇹🇼
taiwan

Bo-Ru (Roy) Lu boru-roylu

🇹🇼
taiwan
View GitHub Profile
@boru-roylu
boru-roylu / ddp_notes.md
Created January 25, 2023 09:04 — forked from TengdaHan/ddp_notes.md
Multi-node-training on slurm with PyTorch

Multi-node-training on slurm with PyTorch

What's this?

  • A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
  • Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
  • Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
  • Warning: might need to re-factor your own code.
  • Warning: might be secretly condemned by your colleagues because using too many GPUs.

Two main workarounds for mitigating the hyak io issues

  • containerizing job environment, apptainer is recommended by Hyak team for both speeding up python startup time and reproducibility

  • copying frequently used data to /tmp dir on the node, /tmp as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak

Build a container image on the gpu node

using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1