Last active
January 23, 2023 08:37
-
-
Save BramVanroy/c35066453b028c3d26ef3b002fd927e3 to your computer and use it in GitHub Desktop.
Combining LMOD with DeepSpeed. As a bonus, also add a command to automatically generate a hostfile.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# If we open a session/job that's on a host that starts with gpu* (e.g. gpu512.dodrio.os), | |
# load PyTorch with CUDA and pdsh | |
# This makes sure that deepspeed/pdsh work in multi node settings | |
if [[ $(hostname) == gpu* ]]; then | |
module load PyTorch/1.12.0-foss-2022a-CUDA-11.7.0; | |
module load pdsh/2.34-GCCcore-11.3.0; | |
fi | |
# Automatically generates a hostfile for the current job in the current directory, | |
# containing each node name with its number of available GPUs, e.g.: | |
# gpu512.dodrio.os slots=4 | |
# gpu513.dodrio.os slots=4 | |
mkhostfile() { | |
if [[ -v SLURM_JOB_NODELIST ]]; then | |
rm -rf hostfile; | |
echo "# Automatically generated hostfile" > hostfile; | |
IFS=','; | |
read -ra arr <<< "$SLURM_JOB_NODELIST"; | |
for node in ${SLURM_JOB_NODELIST[@]}; do | |
n_gpus=$(PDSH_RCMD_TYPE=ssh pdsh -w $node nvidia-smi -L | wc -l); | |
echo "$node slots=$n_gpus" >> hostfile; | |
done | |
else | |
echo "Error: SLURM_JOB_NODELIST environment variable is not set so cannot automatically create hostfile"; | |
fi | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add these lines above to your .bashrc.
The first if-statement is necessary because DeepSpeed uses
pdsh
. So upon ssh'ing into a new host, the required modules are not loaded in lmod there yet. By adding this check in our.bashrc
we make sure that at every new session, whenever a node's hostname starts withgpu
, the right modules are loaded - even in apdsh
ssh session.The second part is a function to automatically generate a hostfile for DeepSpeed in the format
This can be useful to automate your jobs. You could for instance run this function in your PBS script.