Skip to content

Instantly share code, notes, and snippets.

@albestro
albestro / 00-NCCL_NET_PLUGIN.md
Last active July 8, 2025 16:06
cuSolverMP for Alps (GH200)

Does it hang if you disable aws-ofi-nccl usage in NCCL (NCCL_NET_PLUGIN=none)? The log you shared seems to indicate NCCL is getting stuck initializing and I wonder if it is due to something getting stuck in the aws-ofi-nccl initialization specifically.

Run on a single node (4 GPUs), it hangs.

NCCL_NET_PLUGIN=none

CAL_LOG_LEVEL=2 NCCL_NET_PLUGIN=none NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o without-plugin-rank%t.txt ./mp_syevd -p 2 -q 2

Makeself generated self-executable installer (see https://makeself.io/), also known by their .run extension, are a common and popular way of packaging.

Put simply, they embed in a single file both the extracting script together with the tar archive with the package. After extracting it, a custom command can be executed (e.g. the actual custom installer).

Use-case

It may be needed in some cases to just get access to a file in the archive embedded in the .run, without having to fully install/extract the package. Makeself gives the ability to pass options to the tar command executed internally, so that it is possible to control the extraction process.

Reference