Bo-Ru (Roy) Lu boru-roylu

🇹🇼

taiwan

Applied Scientist @ Amazon | NLP PhD @ University of Washington

boru-roylu / ddp_notes.md

Created January 25, 2023 09:04 — forked from TengdaHan/ddp_notes.md

Multi-node-training on slurm with PyTorch

A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.

boru-roylu / hyak_job_worflow.md

Created August 25, 2022 21:19 — forked from csarron/hyak_job_worflow.md

Two main workarounds for mitigating the hyak io issues

containerizing job environment, apptainer is recommended by Hyak team for both speeding up python startup time and reproducibility
copying frequently used data to /tmp dir on the node, /tmp as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak

using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1