Does it hang if you disable aws-ofi-nccl usage in NCCL (NCCL_NET_PLUGIN=none)? The log you shared seems to indicate NCCL is getting stuck initializing and I wonder if it is due to something getting stuck in the aws-ofi-nccl initialization specifically.
Run on a single node (4 GPUs), it hangs.
CAL_LOG_LEVEL=2 NCCL_NET_PLUGIN=none NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o without-plugin-rank%t.txt ./mp_syevd -p 2 -q 2
00-nccl-net-none-without-plugin-rank0.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005418:218560:218560 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:21][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_allreduce] UCC allreduce in-place
nid005418:218560:218560 [0] NCCL INFO cudaDriverVersion 12040
nid005418:218560:218560 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218560:218560 [0] NCCL INFO Comm config Blocking set to 1
nid005418:218560:218716 [0] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218560:218716 [0] NCCL INFO NET/IB : No device found.
nid005418:218560:218716 [0] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218560:218716 [0] NCCL INFO Using network Socket
slurmstepd: error: *** STEP 1251368.8 ON nid005418 CANCELLED AT 2025-07-04T12:16:39 ***
00-nccl-net-none-without-plugin-rank1.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218561][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218561][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
nid005418:218561:218561 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_send] ucc_transport::send() 1 -> 2, 8192 bytes, tag: 77
nid005418:218561:218561 [1] NCCL INFO cudaDriverVersion 12040
nid005418:218561:218561 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218561:218561 [1] NCCL INFO Comm config Blocking set to 1
nid005418:218561:218714 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218561:218714 [1] NCCL INFO NET/IB : No device found.
nid005418:218561:218714 [1] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218561:218714 [1] NCCL INFO Using network Socket
00-nccl-net-none-without-plugin-rank2.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
nid005418:218562:218562 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 1, 8192 bytes, tag: 77
nid005418:218562:218562 [2] NCCL INFO cudaDriverVersion 12040
nid005418:218562:218562 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218562:218562 [2] NCCL INFO Comm config Blocking set to 1
nid005418:218562:218715 [2] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218562:218715 [2] NCCL INFO NET/IB : No device found.
nid005418:218562:218715 [2] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218562:218715 [2] NCCL INFO Using network Socket
00-nccl-net-none-without-plugin-rank3.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218563][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218563][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_bcast] UCC bcast
nid005418:218563:218563 [3] NCCL INFO cudaDriverVersion 12040
nid005418:218563:218563 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
nid005418:218563:218563 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218563:218563 [3] NCCL INFO Comm config Blocking set to 1
nid005418:218563:218717 [3] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218563:218717 [3] NCCL INFO NET/IB : No device found.
nid005418:218563:218717 [3] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218563:218717 [3] NCCL INFO Using network Socket
CAL_LOG_LEVEL=2 NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o with-plugin-rank%t.txt ./mp_syevd -p 2 -q 2
00-nccl-net-none-with-plugin-rank0.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005418:218936:218936 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:04][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_allreduce] UCC allreduce in-place
nid005418:218936:218936 [0] NCCL INFO cudaDriverVersion 12040
nid005418:218936:218936 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218936:218936 [0] NCCL INFO Comm config Blocking set to 1
nid005418:218936:219060 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218936:219060 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218936:219060 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060
nid005418:218936:219060 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218936:219060 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218936:219060 [0] NCCL INFO NET/OFI Creating one domain per process
nid005418:218936:219060 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218936:219060 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218936:219060 [0] NCCL INFO Using network Libfabric
slurmstepd: error: *** STEP 1251368.9 ON nid005418 CANCELLED AT 2025-07-04T12:17:15 ***
00-nccl-net-none-with-plugin-rank1.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218937][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218937][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
nid005418:218937:218937 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_send] ucc_transport::send() 1 -> 2, 8192 bytes, tag: 77
nid005418:218937:218937 [1] NCCL INFO cudaDriverVersion 12040
nid005418:218937:218937 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218937:218937 [1] NCCL INFO Comm config Blocking set to 1
nid005418:218937:219059 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218937:219059 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218937:219059 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060
nid005418:218937:219059 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218937:219059 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218937:219059 [1] NCCL INFO NET/OFI Creating one domain per process
nid005418:218937:219059 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218937:219059 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218937:219059 [1] NCCL INFO Using network Libfabric
00-nccl-net-none-with-plugin-rank2.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
nid005418:218938:218938 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 1, 8192 bytes, tag: 77
nid005418:218938:218938 [2] NCCL INFO cudaDriverVersion 12040
nid005418:218938:218938 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218938:218938 [2] NCCL INFO Comm config Blocking set to 1
nid005418:218938:219047 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218938:219047 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218938:219047 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060
nid005418:218938:219047 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218938:219047 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218938:219047 [2] NCCL INFO NET/OFI Creating one domain per process
nid005418:218938:219047 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218938:219047 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218938:219047 [2] NCCL INFO Using network Libfabric
00-nccl-net-none-with-plugin-rank3.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218939][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218939][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_bcast] UCC bcast
nid005418:218939:218939 [3] NCCL INFO cudaDriverVersion 12040
nid005418:218939:218939 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
nid005418:218939:218939 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218939:218939 [3] NCCL INFO Comm config Blocking set to 1
nid005418:218939:219082 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218939:219082 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218939:219082 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060
nid005418:218939:219082 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218939:219082 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218939:219082 [3] NCCL INFO NET/OFI Creating one domain per process
nid005418:218939:219082 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218939:219082 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218939:219082 [3] NCCL INFO Using network Libfabric