Does it hang if you disable aws-ofi-nccl usage in NCCL (NCCL_NET_PLUGIN=none)? The log you shared seems to indicate NCCL is getting stuck initializing and I wonder if it is due to something getting stuck in the aws-ofi-nccl initialization specifically.
Run on a single node (4 GPUs), it hangs.
CAL_LOG_LEVEL=2 NCCL_NET_PLUGIN=none NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o without-plugin-rank%t.txt ./mp_syevd -p 2 -q 2