Skip to content

Instantly share code, notes, and snippets.

@albestro
Last active July 8, 2025 16:06
Show Gist options
  • Save albestro/e2286bf4c7a3ce86ef1263fd140395d3 to your computer and use it in GitHub Desktop.
Save albestro/e2286bf4c7a3ce86ef1263fd140395d3 to your computer and use it in GitHub Desktop.
cuSolverMP for Alps (GH200)

Does it hang if you disable aws-ofi-nccl usage in NCCL (NCCL_NET_PLUGIN=none)? The log you shared seems to indicate NCCL is getting stuck initializing and I wonder if it is due to something getting stuck in the aws-ofi-nccl initialization specifically.

Run on a single node (4 GPUs), it hangs.

NCCL_NET_PLUGIN=none

CAL_LOG_LEVEL=2 NCCL_NET_PLUGIN=none NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o without-plugin-rank%t.txt ./mp_syevd -p 2 -q 2
00-nccl-net-none-without-plugin-rank0.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005418:218560:218560 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:21][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218560][Trace][cal_allreduce] UCC allreduce in-place
nid005418:218560:218560 [0] NCCL INFO cudaDriverVersion 12040
nid005418:218560:218560 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218560:218560 [0] NCCL INFO Comm config Blocking set to 1
nid005418:218560:218716 [0] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218560:218716 [0] NCCL INFO NET/IB : No device found.
nid005418:218560:218716 [0] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218560:218716 [0] NCCL INFO Using network Socket
slurmstepd: error: *** STEP 1251368.8 ON nid005418 CANCELLED AT 2025-07-04T12:16:39 ***
00-nccl-net-none-without-plugin-rank1.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218561][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218561][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
nid005418:218561:218561 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218561][Trace][cal_send] ucc_transport::send() 1 -> 2, 8192 bytes, tag: 77
nid005418:218561:218561 [1] NCCL INFO cudaDriverVersion 12040
nid005418:218561:218561 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218561:218561 [1] NCCL INFO Comm config Blocking set to 1
nid005418:218561:218714 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218561:218714 [1] NCCL INFO NET/IB : No device found.
nid005418:218561:218714 [1] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218561:218714 [1] NCCL INFO Using network Socket
00-nccl-net-none-without-plugin-rank2.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
nid005418:218562:218562 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218562][Trace][cal_recv] ucc_transport::recv() 2 <- 1, 8192 bytes, tag: 77
nid005418:218562:218562 [2] NCCL INFO cudaDriverVersion 12040
nid005418:218562:218562 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218562:218562 [2] NCCL INFO Comm config Blocking set to 1
nid005418:218562:218715 [2] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218562:218715 [2] NCCL INFO NET/IB : No device found.
nid005418:218562:218715 [2] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218562:218715 [2] NCCL INFO Using network Socket
00-nccl-net-none-without-plugin-rank3.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:16:21][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:21][cal][218563][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218563][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:16:21][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:16:22][cal][218563][Trace][cal_bcast] UCC bcast
nid005418:218563:218563 [3] NCCL INFO cudaDriverVersion 12040
nid005418:218563:218563 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
nid005418:218563:218563 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218563:218563 [3] NCCL INFO Comm config Blocking set to 1
nid005418:218563:218717 [3] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005418:218563:218717 [3] NCCL INFO NET/IB : No device found.
nid005418:218563:218717 [3] NCCL INFO NET/Socket : Using [0]nmn0:10.100.48.106<0> [1]hsn0:172.28.52.41<0> [2]hsn2:172.28.50.173<0> [3]hsn3:172.28.50.172<0> [4]hsn1:172.28.52.40<0>
nid005418:218563:218717 [3] NCCL INFO Using network Socket

unset NCCL_NET_PLUGIN

CAL_LOG_LEVEL=2 NCCL_DEBUG=info CUSOLVERMP_FORCE_NCCL=1 srun -n 4 -u -o with-plugin-rank%t.txt ./mp_syevd -p 2 -q 2
00-nccl-net-none-with-plugin-rank0.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005418:218936:218936 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:04][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 1, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 2, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_send] ucc_transport::send() 0 -> 3, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218936][Trace][cal_allreduce] UCC allreduce in-place
nid005418:218936:218936 [0] NCCL INFO cudaDriverVersion 12040
nid005418:218936:218936 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218936:218936 [0] NCCL INFO Comm config Blocking set to 1
nid005418:218936:219060 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218936:219060 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218936:219060 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005418:218936:219060 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218936:219060 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218936:219060 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218936:219060 [0] NCCL INFO NET/OFI Creating one domain per process
nid005418:218936:219060 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218936:219060 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218936:219060 [0] NCCL INFO Using network Libfabric
slurmstepd: error: *** STEP 1251368.9 ON nid005418 CANCELLED AT 2025-07-04T12:17:15 ***
00-nccl-net-none-with-plugin-rank1.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218937][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218937][Trace][cal_recv] ucc_transport::recv() 1 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
nid005418:218937:218937 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218937][Trace][cal_send] ucc_transport::send() 1 -> 2, 8192 bytes, tag: 77
nid005418:218937:218937 [1] NCCL INFO cudaDriverVersion 12040
nid005418:218937:218937 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218937:218937 [1] NCCL INFO Comm config Blocking set to 1
nid005418:218937:219059 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218937:219059 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218937:219059 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005418:218937:219059 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218937:219059 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218937:219059 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218937:219059 [1] NCCL INFO NET/OFI Creating one domain per process
nid005418:218937:219059 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218937:219059 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218937:219059 [1] NCCL INFO Using network Libfabric
00-nccl-net-none-with-plugin-rank2.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
nid005418:218938:218938 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218938][Trace][cal_recv] ucc_transport::recv() 2 <- 1, 8192 bytes, tag: 77
nid005418:218938:218938 [2] NCCL INFO cudaDriverVersion 12040
nid005418:218938:218938 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218938:218938 [2] NCCL INFO Comm config Blocking set to 1
nid005418:218938:219047 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218938:219047 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218938:219047 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005418:218938:219047 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218938:219047 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218938:219047 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218938:219047 [2] NCCL INFO NET/OFI Creating one domain per process
nid005418:218938:219047 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218938:219047 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218938:219047 [2] NCCL INFO Using network Libfabric
00-nccl-net-none-with-plugin-rank3.txt
Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
[2025-07-04 12:17:04][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:04][cal][218939][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218939][Trace][cal_recv] ucc_transport::recv() 3 <- 0, 32768 bytes, tag: 0
[2025-07-04 12:17:04][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_comm_split] UCC allgather in-place
[2025-07-04 12:17:05][cal][218939][Trace][cal_bcast] UCC bcast
nid005418:218939:218939 [3] NCCL INFO cudaDriverVersion 12040
nid005418:218939:218939 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.106<0>
nid005418:218939:218939 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005418:218939:218939 [3] NCCL INFO Comm config Blocking set to 1
nid005418:218939:219082 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005418:218939:219082 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005418:218939:219082 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005418:218939:219082 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005418:218939:219082 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005418:218939:219082 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005418:218939:219082 [3] NCCL INFO NET/OFI Creating one domain per process
nid005418:218939:219082 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005418:218939:219082 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005418:218939:219082 [3] NCCL INFO Using network Libfabric

Do you have a stack trace from the hang that you can share? Would be useful to see where the workload is getting stuck.

Stacktraces collected with Linaro Forge by pausing after a reasonable amount of time.

Single node run (with 4 GPUs).

CUSOLVERMP_FORCE_NCCL=1 srun -n 4 ddt-client ./mp_syevd -p 2 -q 2

Next are reported the stacktraces of the 4 ranks at the time exececution has been paused.

#17 main (argc=5, argv=0xffffffffbe48) at /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/mp_syevd.c:414 (at 0x403990)
#16 cusolverMpSyevd () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002daf4eb4)
#15 cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dafa6bc)
#14 cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dad444c)
#13 cal_bcast () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000254ee90)
#12 UCCCollImpl::bcast(void*, unsigned long, cudaDataType_t, int, CUstream_st*, cal_memory_type_t) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x4000025796d0)
#11 ucc_collective_init (coll_args=0xffffffff95f8, request=0x241b6b8, team=0x20609f0) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/core/ucc_coll.c:234 (at 0x400034c80728)
#10 ucc_coll_init (map=<optimized out>, bargs=0xffffffff93a0, bargs@entry=0xffffffff93a0, task=0xffffffff9380, task@entry=0xffffffff9380) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/coll_score/ucc_coll_score_map.c:130 (at 0x400034c8825c)
#9 ucc_tl_nccl_coll_init (coll_args=0xffffffff93a0, team=<optimized out>, task_h=0xffffffff9380) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:238 (at 0x400258525aac)
#8 ucc_tl_nccl_init_task (coll_args=0xffffffff93a0, coll_args@entry=0xffffffff93a0, team=0x2064dd0, coll_task=0xffffffff92e8, coll_task@entry=0xffffffff92e8) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_coll.c:148 (at 0x400258526830)
#7 ucc_tl_nccl_comm_init (team=0x2064dd0, team@entry=0x2064dd0) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:152 (at 0x400258525f34)
#6 ncclCommInitRankConfig (newcomm=0x2064f00, newcomm@entry=0x2064f00, nranks=2, nranks@entry=2, commId={...}, myrank=1, myrank@entry=1, config=0xffffffff9240, config@entry=0xffffffff9240) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/init.cc:1798 (at 0x4002585aa2fc)
#5 ncclGroupEndInternal (simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:546 (at 0x4002585a2da0)
#4 groupLaunch (job_=0x241d020, simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:420 (at 0x4002585a1c70)
#3 asyncJobLaunch (asyncJobsMain=0x241d010, asyncJobsMain@entry=0x241d010, groupAbortFlag=0x241d0d8, groupAbortFlag@entry=0x241d0d8) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:370 (at 0x4002585a187c)
#2 usleep () from /lib64/libc.so.6 (at 0x4000342d94b4)
#1 nanosleep () from /lib64/libc.so.6 (at 0x4000342aff94)
#0 clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6 (at 0x4000342aa174)
#17 main (argc=5, argv=0xffffffffbe48) at /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/mp_syevd.c:414 (at 0x403990)
#16 cusolverMpSyevd () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002daf4eb4)
#15 cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dafa6bc)
#14 cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dad52bc)
#13 cal_allreduce () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000254fa80)
#12 UCCCollImpl::allreduce(void const*, void*, unsigned long, cudaDataType_t, cal_op_t, CUstream_st*, cal_memory_type_t) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x400002578c00)
#11 ucc_collective_init (coll_args=0xffffffff95d8, request=0x2405f98, team=0x1fd6e00) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/core/ucc_coll.c:234 (at 0x400034c80728)
#10 ucc_coll_init (map=<optimized out>, bargs=0xffffffff9370, bargs@entry=0xffffffff9370, task=0xffffffff9350, task@entry=0xffffffff9350) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/coll_score/ucc_coll_score_map.c:130 (at 0x400034c8825c)
#9 ucc_tl_nccl_coll_init (coll_args=0xffffffff9370, team=<optimized out>, task_h=0xffffffff9350) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:238 (at 0x400254675aac)
#8 ucc_tl_nccl_init_task (coll_args=0xffffffff9370, coll_args@entry=0xffffffff9370, team=0x1fdb1e0, coll_task=0xffffffff92b8, coll_task@entry=0xffffffff92b8) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_coll.c:148 (at 0x400254676830)
#7 ucc_tl_nccl_comm_init (team=0x1fdb1e0, team@entry=0x1fdb1e0) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:152 (at 0x400254675f34)
#6 ncclCommInitRankConfig (newcomm=0x1fdb310, newcomm@entry=0x1fdb310, nranks=2, nranks@entry=2, commId={...}, myrank=0, myrank@entry=0, config=0xffffffff9210, config@entry=0xffffffff9210) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/init.cc:1798 (at 0x4002c0aaa2fc)
#5 ncclGroupEndInternal (simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:546 (at 0x4002c0aa2da0)
#4 groupLaunch (job_=0x15e5020, simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:420 (at 0x4002c0aa1c70)
#3 asyncJobLaunch (asyncJobsMain=0x15e5010, asyncJobsMain@entry=0x15e5010, groupAbortFlag=0x15e50d8, groupAbortFlag@entry=0x15e50d8) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:370 (at 0x4002c0aa187c)
#2 usleep () from /lib64/libc.so.6 (at 0x4000342d94b4)
#1 nanosleep () from /lib64/libc.so.6 (at 0x4000342aff94)
#0 clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6 (at 0x4000342aa178)
#18 main (argc=5, argv=0xffffffffbe48) at /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/mp_syevd.c:414 (at 0x403990)
#17 cusolverMpSyevd () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002daf4eb4)
#16 cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dafa6bc)
#15 cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dad4790)
#14 cal_send () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000254cdac)
#13 ucc::p2p::send(void const*, unsigned long, int, CUstream_st*, cal_memory_type_t) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000257c990)
#12 ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000257c5e8)
#11 ucc_collective_init (coll_args=0xffffffff9588, request=0xffffffff9560, team=0x15c0aa0) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/core/ucc_coll.c:234 (at 0x400034c80728)
#10 ucc_coll_init (map=<optimized out>, bargs=0xffffffff9350, bargs@entry=0xffffffff9350, task=0xffffffff9330, task@entry=0xffffffff9330) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/coll_score/ucc_coll_score_map.c:130 (at 0x400034c8825c)
#9 ucc_tl_nccl_coll_init (coll_args=0xffffffff9350, team=<optimized out>, task_h=0xffffffff9330) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:238 (at 0x400258525aac)
#8 ucc_tl_nccl_init_task (coll_args=0xffffffff9350, coll_args@entry=0xffffffff9350, team=0x90bf20, coll_task=0xffffffff9298, coll_task@entry=0xffffffff9298) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_coll.c:148 (at 0x400258526830)
#7 ucc_tl_nccl_comm_init (team=0x90bf20, team@entry=0x90bf20) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:152 (at 0x400258525f34)
#6 ncclCommInitRankConfig (newcomm=0x90c050, newcomm@entry=0x90c050, nranks=4, nranks@entry=4, commId={...}, myrank=1, myrank@entry=1, config=0xffffffff91f0, config@entry=0xffffffff91f0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/init.cc:1798 (at 0x4002585aa2fc)
#5 ncclGroupEndInternal (simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:546 (at 0x4002585a2da0)
#4 groupLaunch (job_=0x205c020, simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:420 (at 0x4002585a1c70)
#3 asyncJobLaunch (asyncJobsMain=0x205c010, asyncJobsMain@entry=0x205c010, groupAbortFlag=0x205c0d8, groupAbortFlag@entry=0x205c0d8) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:370 (at 0x4002585a187c)
#2 usleep () from /lib64/libc.so.6 (at 0x4000342d94b4)
#1 nanosleep () from /lib64/libc.so.6 (at 0x4000342aff94)
#0 clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6 (at 0x4000342aa178)
#18 main (argc=5, argv=0xffffffffbe48) at /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/mp_syevd.c:414 (at 0x403990)
#17 cusolverMpSyevd () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002daf4eb4)
#16 cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dafa6bc)
#15 cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcusolvermp-linux-sbsa-0.6.0.712_cuda12-archive/lib/libcusolverMp.so.0 (at 0x40002dad4678)
#14 cal_recv () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000254c934)
#13 ucc::p2p::recv(void*, unsigned long, int, CUstream_st*, cal_memory_type_t) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000257cafc)
#12 ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type) () from /capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/libcal-linux-sbsa-0.4.4.50_cuda12-archive/lib/libcal.so.0 (at 0x40000257c5e8)
#11 ucc_collective_init (coll_args=0xffffffff9588, request=0xffffffff9560, team=0x15c0b60) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/core/ucc_coll.c:234 (at 0x400034c80728)
#10 ucc_coll_init (map=<optimized out>, bargs=0xffffffff9350, bargs@entry=0xffffffff9350, task=0xffffffff9330, task@entry=0xffffffff9330) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/coll_score/ucc_coll_score_map.c:130 (at 0x400034c8825c)
#9 ucc_tl_nccl_coll_init (coll_args=0xffffffff9350, team=<optimized out>, task_h=0xffffffff9330) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:238 (at 0x400258525aac)
#8 ucc_tl_nccl_init_task (coll_args=0xffffffff9350, coll_args@entry=0xffffffff9350, team=0x90c030, coll_task=0xffffffff9298, coll_task@entry=0xffffffff9298) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_coll.c:148 (at 0x400258526830)
#7 ucc_tl_nccl_comm_init (team=0x90c030, team@entry=0x90c030) at /tmp/ialberto/spack-stage/spack-stage-ucc-1.3.0-3ck5jvglljrri6kdfjedmyjq7nmx6yfl/spack-src/src/components/tl/nccl/tl_nccl_team.c:152 (at 0x400258525f34)
#6 ncclCommInitRankConfig (newcomm=0x90c160, newcomm@entry=0x90c160, nranks=4, nranks@entry=4, commId={...}, myrank=2, myrank@entry=2, config=0xffffffff91f0, config@entry=0xffffffff91f0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/init.cc:1798 (at 0x4002585aa2fc)
#5 ncclGroupEndInternal (simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:546 (at 0x4002585a2da0)
#4 groupLaunch (job_=0x2065020, simInfo=0x0, simInfo@entry=0x0) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:420 (at 0x4002585a1c70)
#3 asyncJobLaunch (asyncJobsMain=0x2065010, asyncJobsMain@entry=0x2065010, groupAbortFlag=0x20650d8, groupAbortFlag@entry=0x20650d8) at /tmp/ialberto/spack-stage/spack-stage-nccl-2.22.3-1-jyvybnnnoqhymd344wtvvd53jattmr54/spack-src/src/group.cc:370 (at 0x4002585a187c)
#2 usleep () from /lib64/libc.so.6 (at 0x4000342d94b4)
#1 nanosleep () from /lib64/libc.so.6 (at 0x4000342aff94)
#0 clock_nanosleep@@GLIBC_2.17 () from /lib64/libc.so.6 (at 0x4000342aa174)

All threads stacktraces

Previous ones are callstacks of main thread only. What follows is collected from another equivalent run, but each snippet is the status of all threads of one of the rank.

unset NCCL_NET_PLUGIN=none

NCCL_DEBUG=info srun -n4 ddt-client ./mp_syevd -p 2 -q 2
Threads,Function
4,bootstrapRoot (bootstrap.cc:129)
4,  ncclSocketAccept (socket.cc:668)
4,    socketProgressState (socket.cc:561)
4,      socketTryAccept (socket.cc:411)
4,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_allreduce
1,          UCCCollImpl::allreduce(void const*, void*, unsigned long, cudaDataType_t, cal_op_t, CUstream_st*, cal_memory_type_t)
1,            ucc_collective_init (ucc_coll.c:234)
1,              ucc_coll_init (ucc_coll_score_map.c:130)
1,                ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                  ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                    ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                      ncclCommInitRankConfig (init.cc:1798)
1,                        ncclGroupEndInternal (group.cc:546)
1,                          groupLaunch (group.cc:420)
1,                            asyncJobLaunch (group.cc:370)
1,                              usleep
1,                                nanosleep
1,                                  clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,bootstrapRoot (bootstrap.cc:129)
1,  ncclSocketAccept (socket.cc:668)
1,    socketProgressState (socket.cc:561)
1,      socketTryAccept (socket.cc:411)
1,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_recv
1,          ucc::p2p::recv(void*, unsigned long, int, CUstream_st*, cal_memory_type_t)
1,            ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type)
1,              ucc_collective_init (ucc_coll.c:234)
1,                ucc_coll_init (ucc_coll_score_map.c:130)
1,                  ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                    ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                      ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                        ncclCommInitRankConfig (init.cc:1798)
1,                          ncclGroupEndInternal (group.cc:546)
1,                            groupLaunch (group.cc:420)
1,                              asyncJobLaunch (group.cc:370)
1,                                usleep
1,                                  nanosleep
1,                                    clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_bcast
1,          UCCCollImpl::bcast(void*, unsigned long, cudaDataType_t, int, CUstream_st*, cal_memory_type_t)
1,            ucc_collective_init (ucc_coll.c:234)
1,              ucc_coll_init (ucc_coll_score_map.c:130)
1,                ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                  ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                    ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                      ncclCommInitRankConfig (init.cc:1798)
1,                        ncclGroupEndInternal (group.cc:546)
1,                          groupLaunch (group.cc:420)
1,                            asyncJobLaunch (group.cc:370)
1,                              usleep
1,                                nanosleep
1,                                  clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,bootstrapRoot (bootstrap.cc:129)
1,  ncclSocketAccept (socket.cc:668)
1,    socketProgressState (socket.cc:561)
1,      socketTryAccept (socket.cc:411)
1,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_send
1,          ucc::p2p::send(void const*, unsigned long, int, CUstream_st*, cal_memory_type_t)
1,            ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type)
1,              ucc_collective_init (ucc_coll.c:234)
1,                ucc_coll_init (ucc_coll_score_map.c:130)
1,                  ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                    ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                      ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                        ncclCommInitRankConfig (init.cc:1798)
1,                          ncclGroupEndInternal (group.cc:546)
1,                            groupLaunch (group.cc:420)
1,                              asyncJobLaunch (group.cc:370)
1,                                usleep
1,                                  nanosleep
1,                                    clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll

NCCL_NET_PLUGIN=none

NCCL_NET_PLUGIN=none NCCL_DEBUG=info srun -n4 ddt-client ./mp_syevd -p 2 -q 2
Threads,Function
4,bootstrapRoot (bootstrap.cc:129)
4,  ncclSocketAccept (socket.cc:668)
4,    socketProgressState (socket.cc:561)
4,      socketTryAccept (socket.cc:411)
4,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_allreduce
1,          UCCCollImpl::allreduce(void const*, void*, unsigned long, cudaDataType_t, cal_op_t, CUstream_st*, cal_memory_type_t)
1,            ucc_collective_init (ucc_coll.c:234)
1,              ucc_coll_init (ucc_coll_score_map.c:130)
1,                ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                  ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                    ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                      ncclCommInitRankConfig (init.cc:1798)
1,                        ncclGroupEndInternal (group.cc:546)
1,                          groupLaunch (group.cc:420)
1,                            asyncJobLaunch (group.cc:370)
1,                              usleep
1,                                nanosleep
1,                                  clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,bootstrapRoot (bootstrap.cc:129)
1,  ncclSocketAccept (socket.cc:668)
1,    socketProgressState (socket.cc:561)
1,      socketTryAccept (socket.cc:411)
1,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_send
1,          ucc::p2p::send(void const*, unsigned long, int, CUstream_st*, cal_memory_type_t)
1,            ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type)
1,              ucc_collective_init (ucc_coll.c:234)
1,                ucc_coll_init (ucc_coll_score_map.c:130)
1,                  ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                    ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                      ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                        ncclCommInitRankConfig (init.cc:1798)
1,                          ncclGroupEndInternal (group.cc:546)
1,                            groupLaunch (group.cc:420)
1,                              asyncJobLaunch (group.cc:370)
1,                                usleep
1,                                  nanosleep
1,                                    clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,bootstrapRoot (bootstrap.cc:129)
1,  ncclSocketAccept (socket.cc:668)
1,    socketProgressState (socket.cc:561)
1,      socketTryAccept (socket.cc:411)
1,        accept
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_recv
1,          ucc::p2p::recv(void*, unsigned long, int, CUstream_st*, cal_memory_type_t)
1,            ucc::p2p::p2p_init(void*, unsigned long, int, int, unsigned short, CUstream_st*, cal_memory_type_t, ucc::p2p::p2p_type)
1,              ucc_collective_init (ucc_coll.c:234)
1,                ucc_coll_init (ucc_coll_score_map.c:130)
1,                  ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                    ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                      ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                        ncclCommInitRankConfig (init.cc:1798)
1,                          ncclGroupEndInternal (group.cc:546)
1,                            groupLaunch (group.cc:420)
1,                              asyncJobLaunch (group.cc:370)
1,                                usleep
1,                                  nanosleep
1,                                    clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll
Threads,Function
1,main (mp_syevd.c:414)
1,  cusolverMpSyevd
1,    cusolverStatus_t mp_syevd<double, double>(cusolverMpHandle*, char*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, long, long, cusolverMpMatrixDescriptor const*, cudaDataType_t, cudaDataType_t, void*, unsigned long, void*, unsigned long, int*)
1,      cusolverStatus_t mp_sytrd<double, double>(cusolverMpHandle*, cublasFillMode_t, long, void*, long, long, cusolverMpMatrixDescriptor const*, void*, void*, void*, cudaDataType_t, cudaDataType_t, void*, void*, int*, int)
1,        cal_bcast
1,          UCCCollImpl::bcast(void*, unsigned long, cudaDataType_t, int, CUstream_st*, cal_memory_type_t)
1,            ucc_collective_init (ucc_coll.c:234)
1,              ucc_coll_init (ucc_coll_score_map.c:130)
1,                ucc_tl_nccl_coll_init (tl_nccl_team.c:238)
1,                  ucc_tl_nccl_init_task (tl_nccl_coll.c:148)
1,                    ucc_tl_nccl_comm_init (tl_nccl_team.c:152)
1,                      ncclCommInitRankConfig (init.cc:1798)
1,                        ncclGroupEndInternal (group.cc:546)
1,                          groupLaunch (group.cc:420)
1,                            asyncJobLaunch (group.cc:370)
1,                              usleep
1,                                nanosleep
1,                                  clock_nanosleep@@GLIBC_2.17
1,ncclAsyncJobMain (group.cc:68)
1,  ncclCommInitRankFunc (init.cc:1393)
1,    bootstrapInit (bootstrap.cc:292)
1,      ncclSocketAccept (socket.cc:668)
1,        socketProgressState (socket.cc:561)
1,          socketTryAccept (socket.cc:411)
1,            accept
1,ucs_async_thread_func (thread.c:131)
1,  ucs_event_set_wait (event_set.c:198)
1,    epoll_pwait
2,??
2,  ??
2,    ??
2,      poll

If you set NCCL_SOCKET_IFNAME=hsn, does the non-plugin run still hang? This should limit NCCL bootstrapping to use the appropriate network interface on the system.

It still hangs, both with and without plugin.

NCCL_NET_PLUGIN=none NCCL_SOCKET_IFNAME=hsn

NCCL_DEBUG=info NCCL_NET_PLUGIN=none NCCL_SOCKET_IFNAME=hsn srun -n 4 -u -o without-plugin-hsn-%t.out ./mp_syevd -p 2 -q 2

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-plugin-hsn-0.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:97541:97541 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97541:97541 [0] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:97541:97541 [0] NCCL INFO cudaDriverVersion 12040
nid005420:97541:97541 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:97541:97541 [0] NCCL INFO Comm config Blocking set to 1
nid005420:97541:97669 [0] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005420:97541:97669 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97541:97669 [0] NCCL INFO NET/IB : No device found.
nid005420:97541:97669 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97541:97669 [0] NCCL INFO NET/Socket : Using [0]hsn0:172.28.16.176<0> [1]hsn2:172.28.16.178<0> [2]hsn3:172.28.16.179<0> [3]hsn1:172.28.16.177<0>
nid005420:97541:97669 [0] NCCL INFO Using network Socket
slurmstepd: error: *** STEP 1261977.3 ON nid005420 CANCELLED AT 2025-07-08T09:48:20 ***

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-plugin-hsn-1.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:97542:97542 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97542:97542 [1] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:97542:97542 [1] NCCL INFO cudaDriverVersion 12040
nid005420:97542:97542 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:97542:97542 [1] NCCL INFO Comm config Blocking set to 1
nid005420:97542:97667 [1] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005420:97542:97667 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97542:97667 [1] NCCL INFO NET/IB : No device found.
nid005420:97542:97667 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97542:97667 [1] NCCL INFO NET/Socket : Using [0]hsn0:172.28.16.176<0> [1]hsn2:172.28.16.178<0> [2]hsn3:172.28.16.179<0> [3]hsn1:172.28.16.177<0>
nid005420:97542:97667 [1] NCCL INFO Using network Socket

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-plugin-hsn-2.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:97543:97543 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97543:97543 [2] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:97543:97543 [2] NCCL INFO cudaDriverVersion 12040
nid005420:97543:97543 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:97543:97543 [2] NCCL INFO Comm config Blocking set to 1
nid005420:97543:97668 [2] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005420:97543:97668 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97543:97668 [2] NCCL INFO NET/IB : No device found.
nid005420:97543:97668 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97543:97668 [2] NCCL INFO NET/Socket : Using [0]hsn0:172.28.16.176<0> [1]hsn2:172.28.16.178<0> [2]hsn3:172.28.16.179<0> [3]hsn1:172.28.16.177<0>
nid005420:97543:97668 [2] NCCL INFO Using network Socket

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-plugin-hsn-3.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:97544:97544 [3] NCCL INFO cudaDriverVersion 12040
nid005420:97544:97544 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97544:97544 [3] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:97544:97544 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:97544:97544 [3] NCCL INFO Comm config Blocking set to 1
nid005420:97544:97670 [3] NCCL INFO NET/Plugin: Could not find: none libnccl-net-none.so. Using internal network plugin.
nid005420:97544:97670 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97544:97670 [3] NCCL INFO NET/IB : No device found.
nid005420:97544:97670 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:97544:97670 [3] NCCL INFO NET/Socket : Using [0]hsn0:172.28.16.176<0> [1]hsn2:172.28.16.178<0> [2]hsn3:172.28.16.179<0> [3]hsn1:172.28.16.177<0>
nid005420:97544:97670 [3] NCCL INFO Using network Socket

unset NCCL_NET_PLUGIN NCCL_SOCKET_IFNAME=hsn

NCCL_DEBUG=info NCCL_SOCKET_IFNAME=hsn srun -n 4 -u -o with-plugin-hsn-%t.out ./mp_syevd -p 2 -q 2

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-plugin-hsn-0.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:98436:98436 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:98436:98436 [0] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:98436:98436 [0] NCCL INFO cudaDriverVersion 12040
nid005420:98436:98436 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:98436:98436 [0] NCCL INFO Comm config Blocking set to 1
nid005420:98436:98540 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005420:98436:98540 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005420:98436:98540 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005420:98436:98540 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005420:98436:98540 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005420:98436:98540 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005420:98436:98540 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005420:98436:98540 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005420:98436:98540 [0] NCCL INFO NET/OFI Creating one domain per process
nid005420:98436:98540 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005420:98436:98540 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005420:98436:98540 [0] NCCL INFO Using network Libfabric
slurmstepd: error: *** STEP 1261977.4 ON nid005420 CANCELLED AT 2025-07-08T09:49:49 ***

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-plugin-hsn-1.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:98437:98437 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:98437:98437 [1] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:98437:98437 [1] NCCL INFO cudaDriverVersion 12040
nid005420:98437:98437 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:98437:98437 [1] NCCL INFO Comm config Blocking set to 1
nid005420:98437:98538 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005420:98437:98538 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005420:98437:98538 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005420:98437:98538 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005420:98437:98538 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005420:98437:98538 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005420:98437:98538 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005420:98437:98538 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005420:98437:98538 [1] NCCL INFO NET/OFI Creating one domain per process
nid005420:98437:98538 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005420:98437:98538 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005420:98437:98538 [1] NCCL INFO Using network Libfabric

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-plugin-hsn-2.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:98438:98438 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:98438:98438 [2] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:98438:98438 [2] NCCL INFO cudaDriverVersion 12040
nid005420:98438:98438 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:98438:98438 [2] NCCL INFO Comm config Blocking set to 1
nid005420:98438:98539 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005420:98438:98539 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005420:98438:98539 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005420:98438:98539 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005420:98438:98539 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005420:98438:98539 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005420:98438:98539 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005420:98438:98539 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005420:98438:98539 [2] NCCL INFO NET/OFI Creating one domain per process
nid005420:98438:98539 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005420:98438:98539 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005420:98438:98539 [2] NCCL INFO Using network Libfabric

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-plugin-hsn-3.out

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005420:98439:98439 [3] NCCL INFO cudaDriverVersion 12040
nid005420:98439:98439 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to hsn
nid005420:98439:98439 [3] NCCL INFO Bootstrap : Using hsn0:172.28.16.176<0>
nid005420:98439:98439 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005420:98439:98439 [3] NCCL INFO Comm config Blocking set to 1
nid005420:98439:98541 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005420:98439:98541 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005420:98439:98541 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005420:98439:98541 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005420:98439:98541 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005420:98439:98541 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005420:98439:98541 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005420:98439:98541 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005420:98439:98541 [3] NCCL INFO NET/OFI Creating one domain per process
nid005420:98439:98541 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005420:98439:98541 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005420:98439:98541 [3] NCCL INFO Using network Libfabric

Could you please try another sample like "mp_potrf_potrs -p 2 -q 2", "mp_syevd" (without arguments, defaulting to two processes), "mp_potrf_potrf" (without arguments, also defaulting to two processes)? SYEVD is special in that it uses some of NCCL functionality that other APIs don't.

I tried:

  • POTRF-POTRS (default)
  • POTRF-POTRS 4 ranks
  • SYEVD 2x1 (default)
  • SYEVD 1x2

and all of them works.

I did the test with SYEVD 2x1 and 1x2 to highlight that these two configurations works, differently from the 4 ranks one.

POTRF-POTRS (default)

NCCL_DEBUG=info srun -n2 -u -o potrf-potrs-2ranks-%t.txt ./mp_potrf_potrs

./potrf-potrs-2ranks-0.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0
nid005424:254482:254482 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:254482:254482 [0] NCCL INFO cudaDriverVersion 12040
nid005424:254482:254482 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:254482:254482 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254482:254533 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:254482:254533 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:254482:254533 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:254482:254533 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:254482:254533 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:254482:254533 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:254482:254533 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:254482:254533 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:254482:254533 [0] NCCL INFO NET/OFI Creating one domain per process
nid005424:254482:254533 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005424:254482:254533 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:254482:254533 [0] NCCL INFO Using network Libfabric
nid005424:254482:254533 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254482:254533 [0] NCCL INFO ncclCommInitRank comm 0x2158cc50 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xac9e9aa43c4c42a8 - Init START
nid005424:254482:254533 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254482:254533 [0] NCCL INFO comm 0x2158cc50 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254482:254533 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254482:254533 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254482:254533 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254482:254533 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254482:254533 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:254482:254533 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:254482:254533 [0] NCCL INFO ncclCommInitRank comm 0x2158cc50 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xac9e9aa43c4c42a8 - Init COMPLETE
nid005424:254482:254533 [0] NCCL INFO Init timings: rank 1 nranks 2 total 2.18 (kernels 0.10, bootstrap 1.91, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254482:254590 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 04/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 05/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 06/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Channel 07/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254590 [0] NCCL INFO Connected all rings
nid005424:254482:254482 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254482:254595 [0] NCCL INFO Using network Libfabric
nid005424:254482:254595 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254482:254595 [0] NCCL INFO ncclCommInitRank comm 0x215d38b0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x12768222fba868e - Init START
nid005424:254482:254595 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254482:254595 [0] NCCL INFO comm 0x215d38b0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254482:254595 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254482:254595 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254482:254595 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254482:254595 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254482:254595 [0] NCCL INFO ncclCommInitRank comm 0x215d38b0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x12768222fba868e - Init COMPLETE
nid005424:254482:254595 [0] NCCL INFO Init timings: rank 1 nranks 2 total 0.39 (kernels 0.00, bootstrap 0.10, allgathers 0.00, topo 0.28, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254482:254600 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 04/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 05/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 06/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Channel 07/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:254482:254600 [0] NCCL INFO Connected all rings
nid005424:254482:254482 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254482:254605 [0] NCCL INFO Using network Libfabric
nid005424:254482:254605 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254482:254605 [0] NCCL INFO ncclCommInitRank comm 0x24a33c80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x1d0386747e21547c - Init START
nid005424:254482:254605 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254482:254605 [0] NCCL INFO comm 0x24a33c80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254482:254605 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:254482:254605 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254482:254605 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254482:254605 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254482:254605 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254482:254605 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254482:254605 [0] NCCL INFO ncclCommInitRank comm 0x24a33c80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x1d0386747e21547c - Init COMPLETE
nid005424:254482:254605 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.17 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254482:254613 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254482:254613 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM

|b - A*x|_inf = 4.440892E-16
|x|_inf = 6.847888E-01
|b|_inf = 1.000000E+01
|A|_inf = 1.843924E+01
|b - A*x|/(|A|*|x|+|b|) = 1.962653E-17

nid005424:254482:254482 [0] NCCL INFO comm 0x2158cc50 rank 1 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:254482:254482 [0] NCCL INFO comm 0x215d38b0 rank 1 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:254482:254482 [0] NCCL INFO comm 0x24a33c80 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE

./potrf-potrs-2ranks-1.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0
nid005424:254483:254483 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:254483:254483 [1] NCCL INFO cudaDriverVersion 12040
nid005424:254483:254483 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:254483:254483 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254483:254534 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:254483:254534 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:254483:254534 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:254483:254534 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:254483:254534 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:254483:254534 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:254483:254534 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:254483:254534 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:254483:254534 [1] NCCL INFO NET/OFI Creating one domain per process
nid005424:254483:254534 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005424:254483:254534 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:254483:254534 [1] NCCL INFO Using network Libfabric
nid005424:254483:254534 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254483:254534 [1] NCCL INFO ncclCommInitRank comm 0x3142dc30 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xac9e9aa43c4c42a8 - Init START
nid005424:254483:254534 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254483:254534 [1] NCCL INFO comm 0x3142dc30 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254483:254534 [1] NCCL INFO Channel 00/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 01/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 02/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 03/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 04/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 05/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 06/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Channel 07/08 :    0   1
nid005424:254483:254534 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254483:254534 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254483:254534 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254483:254534 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254483:254534 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254483:254534 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:254483:254534 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:254483:254534 [1] NCCL INFO ncclCommInitRank comm 0x3142dc30 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xac9e9aa43c4c42a8 - Init COMPLETE
nid005424:254483:254534 [1] NCCL INFO Init timings: rank 0 nranks 2 total 2.16 (kernels 0.08, bootstrap 1.91, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254483:254589 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 02/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 03/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 04/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 05/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254589 [1] NCCL INFO Connected all rings
nid005424:254483:254483 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254483:254594 [1] NCCL INFO Using network Libfabric
nid005424:254483:254594 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254483:254594 [1] NCCL INFO ncclCommInitRank comm 0x31474890 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x12768222fba868e - Init START
nid005424:254483:254594 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254483:254594 [1] NCCL INFO comm 0x31474890 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254483:254594 [1] NCCL INFO Channel 00/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 01/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 02/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 03/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 04/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 05/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 06/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Channel 07/08 :    0   1
nid005424:254483:254594 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254483:254594 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254483:254594 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254483:254594 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254483:254594 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254483:254594 [1] NCCL INFO ncclCommInitRank comm 0x31474890 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x12768222fba868e - Init COMPLETE
nid005424:254483:254594 [1] NCCL INFO Init timings: rank 0 nranks 2 total 0.39 (kernels 0.00, bootstrap 0.10, allgathers 0.00, topo 0.28, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254483:254601 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 02/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 03/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 04/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 05/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:254483:254601 [1] NCCL INFO Connected all rings
nid005424:254483:254483 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254483:254606 [1] NCCL INFO Using network Libfabric
nid005424:254483:254606 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254483:254606 [1] NCCL INFO ncclCommInitRank comm 0x33e24650 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x1d0386747e21547c - Init START
nid005424:254483:254606 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254483:254606 [1] NCCL INFO comm 0x33e24650 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254483:254606 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254483:254606 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254483:254606 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254483:254606 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254483:254606 [1] NCCL INFO ncclCommInitRank comm 0x33e24650 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x1d0386747e21547c - Init COMPLETE
nid005424:254483:254606 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254483:254611 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254611 [1] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254483:254483 [1] NCCL INFO comm 0x3142dc30 rank 0 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:254483:254483 [1] NCCL INFO comm 0x31474890 rank 0 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:254483:254483 [1] NCCL INFO comm 0x33e24650 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE

POTRF-POTRS 4 ranks

NCCL_DEBUG=info srun -n4 -u -o potrf-potrs-4ranks-%t.txt ./mp_potrf_potrs -p 2 -q 2

./potrf-potrs-4ranks-0.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005424:253796:253796 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:253796:253796 [0] NCCL INFO cudaDriverVersion 12040
nid005424:253796:253796 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:253796:253796 [0] NCCL INFO Comm config Blocking set to 1
nid005424:253796:253896 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:253796:253896 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:253796:253896 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:253796:253896 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:253796:253896 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:253796:253896 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:253796:253896 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:253796:253896 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:253796:253896 [0] NCCL INFO NET/OFI Creating one domain per process
nid005424:253796:253896 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005424:253796:253896 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:253796:253896 [0] NCCL INFO Using network Libfabric
nid005424:253796:253896 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:253796:253896 [0] NCCL INFO ncclCommInitRank comm 0x4d494a0 rank 1 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0xdb9dc2f9efbd4b3f - Init START
nid005424:253796:253896 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:253796:253896 [0] NCCL INFO NVLS multicast support is not available on dev 0
nid005424:253796:253896 [0] NCCL INFO comm 0x4d494a0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
nid005424:253796:253896 [0] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 3/-1/-1->1->0 [2] 3/-1/-1->1->0 [3] 3/-1/-1->1->0 [4] 2/-1/-1->1->3 [5] 2/-1/-1->1->3 [6] 2/-1/-1->1->3 [7] 2/-1/-1->1->3 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 3/-1/-1->1->0 [13] 3/-1/-1->1->0 [14] 3/-1/-1->1->0 [15] 3/-1/-1->1->0 [16] 2/-1/-1->1->3 [17] 2/-1/-1->1->3 [18] 2/-1/-1->1->3 [19] 2/-1/-1->1->3 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
nid005424:253796:253896 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:253796:253896 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253796:253896 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253796:253896 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:253796:253896 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:253796:253896 [0] NCCL INFO ncclCommInitRank comm 0x4d494a0 rank 1 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0xdb9dc2f9efbd4b3f - Init COMPLETE
nid005424:253796:253896 [0] NCCL INFO Init timings: rank 1 nranks 4 total 3.31 (kernels 0.09, bootstrap 2.54, allgathers 0.12, topo 0.47, graphs 0.03, connections 0.04, rest 0.02)
nid005424:253796:253928 [0] NCCL INFO Channel 01/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 03/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 07/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 09/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 13/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 15/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 19/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 21/0 : 1[0] -> 2[3] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 00/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 04/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 06/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 10/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 12/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 16/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 18/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 22/0 : 1[0] -> 3[2] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 05/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 08/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 11/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 14/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 17/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 20/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Channel 23/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253928 [0] NCCL INFO Connected all rings
nid005424:253796:253796 [0] NCCL INFO Comm config Blocking set to 1
nid005424:253796:253942 [0] NCCL INFO Using network Libfabric
nid005424:253796:253942 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:253796:253942 [0] NCCL INFO ncclCommInitRank comm 0x4d90100 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x8dfe9821e42fb0c9 - Init START
nid005424:253796:253942 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:253796:253942 [0] NCCL INFO comm 0x4d90100 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:253796:253942 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:253796:253942 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:253796:253942 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253796:253942 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253796:253942 [0] NCCL INFO ncclCommInitRank comm 0x4d90100 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x8dfe9821e42fb0c9 - Init COMPLETE
nid005424:253796:253942 [0] NCCL INFO Init timings: rank 1 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253796:253949 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 04/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 05/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 06/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Channel 07/0 : 1[0] -> 0[1] via P2P/CUMEM
nid005424:253796:253949 [0] NCCL INFO Connected all rings
nid005424:253796:253796 [0] NCCL INFO Comm config Blocking set to 1
nid005424:253796:253954 [0] NCCL INFO Using network Libfabric
nid005424:253796:253954 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:253796:253954 [0] NCCL INFO ncclCommInitRank comm 0x5718310 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xfb66864debd73870 - Init START
nid005424:253796:253954 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:253796:253954 [0] NCCL INFO comm 0x5718310 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:253796:253954 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:253796:253954 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:253796:253954 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:253796:253954 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253796:253954 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253796:253954 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253796:253954 [0] NCCL INFO ncclCommInitRank comm 0x5718310 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xfb66864debd73870 - Init COMPLETE
nid005424:253796:253954 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.33 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.31, graphs 0.00, connections 0.02, rest 0.00)
nid005424:253796:253964 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005424:253796:253964 [0] NCCL INFO Connected all rings
nid005424:253796:253796 [0] NCCL INFO Comm config Blocking set to 1
nid005424:253796:254003 [0] NCCL INFO Using network Libfabric
nid005424:253796:254003 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:253796:254003 [0] NCCL INFO ncclCommInitRank comm 0x7345f30 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0x79212cb16b67eedd - Init START
nid005424:253796:254003 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:253796:254003 [0] NCCL INFO NVLS multicast support is not available on dev 0
nid005424:253796:254003 [0] NCCL INFO comm 0x7345f30 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
nid005424:253796:254003 [0] NCCL INFO Channel 00/24 :    0   1   2   3
nid005424:253796:254003 [0] NCCL INFO Channel 01/24 :    0   1   3   2
nid005424:253796:254003 [0] NCCL INFO Channel 02/24 :    0   2   3   1
nid005424:253796:254003 [0] NCCL INFO Channel 03/24 :    0   2   1   3
nid005424:253796:254003 [0] NCCL INFO Channel 04/24 :    0   3   1   2
nid005424:253796:254003 [0] NCCL INFO Channel 05/24 :    0   3   2   1
nid005424:253796:254003 [0] NCCL INFO Channel 06/24 :    0   1   2   3
nid005424:253796:254003 [0] NCCL INFO Channel 07/24 :    0   1   3   2
nid005424:253796:254003 [0] NCCL INFO Channel 08/24 :    0   2   3   1
nid005424:253796:254003 [0] NCCL INFO Channel 09/24 :    0   2   1   3
nid005424:253796:254003 [0] NCCL INFO Channel 10/24 :    0   3   1   2
nid005424:253796:254003 [0] NCCL INFO Channel 11/24 :    0   3   2   1
nid005424:253796:254003 [0] NCCL INFO Channel 12/24 :    0   1   2   3
nid005424:253796:254003 [0] NCCL INFO Channel 13/24 :    0   1   3   2
nid005424:253796:254003 [0] NCCL INFO Channel 14/24 :    0   2   3   1
nid005424:253796:254003 [0] NCCL INFO Channel 15/24 :    0   2   1   3
nid005424:253796:254003 [0] NCCL INFO Channel 16/24 :    0   3   1   2
nid005424:253796:254003 [0] NCCL INFO Channel 17/24 :    0   3   2   1
nid005424:253796:254003 [0] NCCL INFO Channel 18/24 :    0   1   2   3
nid005424:253796:254003 [0] NCCL INFO Channel 19/24 :    0   1   3   2
nid005424:253796:254003 [0] NCCL INFO Channel 20/24 :    0   2   3   1
nid005424:253796:254003 [0] NCCL INFO Channel 21/24 :    0   2   1   3
nid005424:253796:254003 [0] NCCL INFO Channel 22/24 :    0   3   1   2
nid005424:253796:254003 [0] NCCL INFO Channel 23/24 :    0   3   2   1
nid005424:253796:254003 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 2/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 2/-1/-1->0->-1 [7] 2/-1/-1->0->-1 [8] 3/-1/-1->0->1 [9] 3/-1/-1->0->1 [10] 3/-1/-1->0->1 [11] 3/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 2/-1/-1->0->-1 [17] 2/-1/-1->0->-1 [18] 2/-1/-1->0->-1 [19] 2/-1/-1->0->-1 [20] 3/-1/-1->0->1 [21] 3/-1/-1->0->1 [22] 3/-1/-1->0->1 [23] 3/-1/-1->0->1
nid005424:253796:254003 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:253796:254003 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253796:254003 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253796:254003 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253796:254003 [0] NCCL INFO ncclCommInitRank comm 0x7345f30 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0x79212cb16b67eedd - Init COMPLETE
nid005424:253796:254003 [0] NCCL INFO Init timings: rank 0 nranks 4 total 0.41 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.32, graphs 0.03, connections 0.04, rest 0.01)
nid005424:253796:254021 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 03/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 05/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 07/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 09/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 11/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 13/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 15/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 17/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 19/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 21/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 23/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 25/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 27/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 29/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254021 [0] NCCL INFO Channel 31/1 : 0[0] -> 2[2] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 00/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 02/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 04/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 06/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 08/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 10/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 12/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 14/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 16/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 18/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 20/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 22/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 24/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 26/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 28/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254056 [0] NCCL INFO Channel 30/1 : 0[0] -> 3[3] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 09/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 11/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 13/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 15/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 17/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 19/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 21/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 23/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 25/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 27/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 29/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:253796:254068 [0] NCCL INFO Channel 31/1 : 0[0] -> 1[1] via P2P/CUMEM

|b - A*x|_inf = 4.440892E-16
|x|_inf = 6.847888E-01
|b|_inf = 1.000000E+01
|A|_inf = 1.843924E+01
|b - A*x|/(|A|*|x|+|b|) = 1.962653E-17

nid005424:253796:253796 [0] NCCL INFO comm 0x4d494a0 rank 1 nranks 4 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:253796:253796 [0] NCCL INFO comm 0x5718310 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:253796:253796 [0] NCCL INFO comm 0x4d90100 rank 1 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:253796:253796 [0] NCCL INFO comm 0x7345f30 rank 0 nranks 4 cudaDev 0 busId 901000 - Destroy COMPLETE

./potrf-potrs-4ranks-1.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005424:253797:253797 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:253797:253797 [1] NCCL INFO cudaDriverVersion 12040
nid005424:253797:253797 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:253797:253797 [1] NCCL INFO Comm config Blocking set to 1
nid005424:253797:253899 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:253797:253899 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:253797:253899 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:253797:253899 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:253797:253899 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:253797:253899 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:253797:253899 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:253797:253899 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:253797:253899 [1] NCCL INFO NET/OFI Creating one domain per process
nid005424:253797:253899 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005424:253797:253899 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:253797:253899 [1] NCCL INFO Using network Libfabric
nid005424:253797:253899 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:253797:253899 [1] NCCL INFO ncclCommInitRank comm 0x7d1af90 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xdb9dc2f9efbd4b3f - Init START
nid005424:253797:253899 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:253797:253899 [1] NCCL INFO NVLS multicast support is not available on dev 1
nid005424:253797:253899 [1] NCCL INFO comm 0x7d1af90 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
nid005424:253797:253899 [1] NCCL INFO Channel 00/24 :    0   1   3   2
nid005424:253797:253899 [1] NCCL INFO Channel 01/24 :    0   1   2   3
nid005424:253797:253899 [1] NCCL INFO Channel 02/24 :    0   3   2   1
nid005424:253797:253899 [1] NCCL INFO Channel 03/24 :    0   3   1   2
nid005424:253797:253899 [1] NCCL INFO Channel 04/24 :    0   2   1   3
nid005424:253797:253899 [1] NCCL INFO Channel 05/24 :    0   2   3   1
nid005424:253797:253899 [1] NCCL INFO Channel 06/24 :    0   1   3   2
nid005424:253797:253899 [1] NCCL INFO Channel 07/24 :    0   1   2   3
nid005424:253797:253899 [1] NCCL INFO Channel 08/24 :    0   3   2   1
nid005424:253797:253899 [1] NCCL INFO Channel 09/24 :    0   3   1   2
nid005424:253797:253899 [1] NCCL INFO Channel 10/24 :    0   2   1   3
nid005424:253797:253899 [1] NCCL INFO Channel 11/24 :    0   2   3   1
nid005424:253797:253899 [1] NCCL INFO Channel 12/24 :    0   1   3   2
nid005424:253797:253899 [1] NCCL INFO Channel 13/24 :    0   1   2   3
nid005424:253797:253899 [1] NCCL INFO Channel 14/24 :    0   3   2   1
nid005424:253797:253899 [1] NCCL INFO Channel 15/24 :    0   3   1   2
nid005424:253797:253899 [1] NCCL INFO Channel 16/24 :    0   2   1   3
nid005424:253797:253899 [1] NCCL INFO Channel 17/24 :    0   2   3   1
nid005424:253797:253899 [1] NCCL INFO Channel 18/24 :    0   1   3   2
nid005424:253797:253899 [1] NCCL INFO Channel 19/24 :    0   1   2   3
nid005424:253797:253899 [1] NCCL INFO Channel 20/24 :    0   3   2   1
nid005424:253797:253899 [1] NCCL INFO Channel 21/24 :    0   3   1   2
nid005424:253797:253899 [1] NCCL INFO Channel 22/24 :    0   2   1   3
nid005424:253797:253899 [1] NCCL INFO Channel 23/24 :    0   2   3   1
nid005424:253797:253899 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 3/-1/-1->0->-1 [5] 3/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1 [8] 2/-1/-1->0->1 [9] 2/-1/-1->0->1 [10] 2/-1/-1->0->1 [11] 2/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 3/-1/-1->0->-1 [17] 3/-1/-1->0->-1 [18] 3/-1/-1->0->-1 [19] 3/-1/-1->0->-1 [20] 2/-1/-1->0->1 [21] 2/-1/-1->0->1 [22] 2/-1/-1->0->1 [23] 2/-1/-1->0->1
nid005424:253797:253899 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:253797:253899 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253797:253899 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253797:253899 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253797:253899 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:253797:253899 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:253797:253899 [1] NCCL INFO ncclCommInitRank comm 0x7d1af90 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xdb9dc2f9efbd4b3f - Init COMPLETE
nid005424:253797:253899 [1] NCCL INFO Init timings: rank 0 nranks 4 total 3.29 (kernels 0.07, bootstrap 2.54, allgathers 0.12, topo 0.47, graphs 0.03, connections 0.04, rest 0.02)
nid005424:253797:253925 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 12/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 13/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 18/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 19/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 04/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 05/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 10/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 11/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 16/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 17/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 22/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 23/0 : 0[1] -> 2[3] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 02/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 03/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 08/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 09/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 14/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 15/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 20/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Channel 21/0 : 0[1] -> 3[2] via P2P/CUMEM
nid005424:253797:253925 [1] NCCL INFO Connected all rings
nid005424:253797:253797 [1] NCCL INFO Comm config Blocking set to 1
nid005424:253797:253941 [1] NCCL INFO Using network Libfabric
nid005424:253797:253941 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:253797:253941 [1] NCCL INFO ncclCommInitRank comm 0x7d61d60 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x8dfe9821e42fb0c9 - Init START
nid005424:253797:253941 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:253797:253941 [1] NCCL INFO comm 0x7d61d60 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:253797:253941 [1] NCCL INFO Channel 00/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 01/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 02/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 03/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 04/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 05/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 06/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Channel 07/08 :    0   1
nid005424:253797:253941 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:253797:253941 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:253797:253941 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253797:253941 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253797:253941 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253797:253941 [1] NCCL INFO ncclCommInitRank comm 0x7d61d60 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x8dfe9821e42fb0c9 - Init COMPLETE
nid005424:253797:253941 [1] NCCL INFO Init timings: rank 0 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253797:253950 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 02/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 03/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 04/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 05/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[0] via P2P/CUMEM
nid005424:253797:253950 [1] NCCL INFO Connected all rings
nid005424:253797:253797 [1] NCCL INFO Comm config Blocking set to 1
nid005424:253797:253955 [1] NCCL INFO Using network Libfabric
nid005424:253797:253955 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:253797:253955 [1] NCCL INFO ncclCommInitRank comm 0x86e9e20 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xecf35a522473b0af - Init START
nid005424:253797:253955 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:253797:253955 [1] NCCL INFO comm 0x86e9e20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:253797:253955 [1] NCCL INFO Channel 00/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 01/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 02/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 03/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 04/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 05/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 06/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Channel 07/08 :    0   1
nid005424:253797:253955 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:253797:253955 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:253797:253955 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253797:253955 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253797:253955 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253797:253955 [1] NCCL INFO ncclCommInitRank comm 0x86e9e20 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xecf35a522473b0af - Init COMPLETE
nid005424:253797:253955 [1] NCCL INFO Init timings: rank 0 nranks 2 total 0.31 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253797:253966 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 02/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 03/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 04/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 05/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005424:253797:253966 [1] NCCL INFO Connected all rings
nid005424:253797:253797 [1] NCCL INFO Comm config Blocking set to 1
nid005424:253797:254005 [1] NCCL INFO Using network Libfabric
nid005424:253797:254005 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:253797:254005 [1] NCCL INFO ncclCommInitRank comm 0xa760870 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x79212cb16b67eedd - Init START
nid005424:253797:254005 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:253797:254005 [1] NCCL INFO NVLS multicast support is not available on dev 1
nid005424:253797:254005 [1] NCCL INFO comm 0xa760870 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
nid005424:253797:254005 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 3/-1/-1->1->2 [17] 3/-1/-1->1->2 [18] 3/-1/-1->1->2 [19] 3/-1/-1->1->2 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
nid005424:253797:254005 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:253797:254005 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253797:254005 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253797:254005 [1] NCCL INFO ncclCommInitRank comm 0xa760870 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x79212cb16b67eedd - Init COMPLETE
nid005424:253797:254005 [1] NCCL INFO Init timings: rank 1 nranks 4 total 0.41 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.32, graphs 0.03, connections 0.04, rest 0.01)
nid005424:253797:254017 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 08/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 10/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 12/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 16/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 18/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 20/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 22/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 24/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 26/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 28/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254017 [1] NCCL INFO Channel 30/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 01/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 03/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 05/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 07/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 09/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 11/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 13/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 15/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 17/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 19/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 21/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 23/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 25/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 27/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 29/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:254058 [1] NCCL INFO Channel 31/1 : 1[1] -> 3[3] via P2P/CUMEM
nid005424:253797:253797 [1] NCCL INFO comm 0x7d1af90 rank 0 nranks 4 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:253797:253797 [1] NCCL INFO comm 0x86e9e20 rank 0 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:253797:253797 [1] NCCL INFO comm 0x7d61d60 rank 0 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:253797:253797 [1] NCCL INFO comm 0xa760870 rank 1 nranks 4 cudaDev 1 busId 1901000 - Destroy COMPLETE

./potrf-potrs-4ranks-2.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005424:253798:253798 [2] NCCL INFO cudaDriverVersion 12040
nid005424:253798:253798 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:253798:253798 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:253798:253798 [2] NCCL INFO Comm config Blocking set to 1
nid005424:253798:253898 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:253798:253898 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:253798:253898 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:253798:253898 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:253798:253898 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:253798:253898 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:253798:253898 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:253798:253898 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:253798:253898 [2] NCCL INFO NET/OFI Creating one domain per process
nid005424:253798:253898 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005424:253798:253898 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:253798:253898 [2] NCCL INFO Using network Libfabric
nid005424:253798:253898 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005424:253798:253898 [2] NCCL INFO ncclCommInitRank comm 0x34a924f0 rank 3 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xdb9dc2f9efbd4b3f - Init START
nid005424:253798:253898 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005424:253798:253898 [2] NCCL INFO NVLS multicast support is not available on dev 2
nid005424:253798:253898 [2] NCCL INFO comm 0x34a924f0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
nid005424:253798:253898 [2] NCCL INFO Trees [0] 2/-1/-1->3->1 [1] 2/-1/-1->3->1 [2] 2/-1/-1->3->1 [3] 2/-1/-1->3->1 [4] 1/-1/-1->3->0 [5] 1/-1/-1->3->0 [6] 1/-1/-1->3->0 [7] 1/-1/-1->3->0 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] -1/-1/-1->3->2 [11] -1/-1/-1->3->2 [12] 2/-1/-1->3->1 [13] 2/-1/-1->3->1 [14] 2/-1/-1->3->1 [15] 2/-1/-1->3->1 [16] 1/-1/-1->3->0 [17] 1/-1/-1->3->0 [18] 1/-1/-1->3->0 [19] 1/-1/-1->3->0 [20] -1/-1/-1->3->2 [21] -1/-1/-1->3->2 [22] -1/-1/-1->3->2 [23] -1/-1/-1->3->2
nid005424:253798:253898 [2] NCCL INFO P2P Chunksize set to 524288
nid005424:253798:253898 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253798:253898 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253798:253898 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:253798:253898 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:253798:253898 [2] NCCL INFO ncclCommInitRank comm 0x34a924f0 rank 3 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xdb9dc2f9efbd4b3f - Init COMPLETE
nid005424:253798:253898 [2] NCCL INFO Init timings: rank 3 nranks 4 total 3.31 (kernels 0.09, bootstrap 2.54, allgathers 0.12, topo 0.47, graphs 0.03, connections 0.04, rest 0.02)
nid005424:253798:253926 [2] NCCL INFO Channel 01/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 04/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 07/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 10/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 13/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 16/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 19/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 22/0 : 3[2] -> 0[1] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 03/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 05/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 09/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 11/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 15/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 17/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 21/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 23/0 : 3[2] -> 1[0] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 00/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 02/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 06/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 08/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 12/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 14/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 18/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Channel 20/0 : 3[2] -> 2[3] via P2P/CUMEM
nid005424:253798:253926 [2] NCCL INFO Connected all rings
nid005424:253798:253798 [2] NCCL INFO Comm config Blocking set to 1
nid005424:253798:253943 [2] NCCL INFO Using network Libfabric
nid005424:253798:253943 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005424:253798:253943 [2] NCCL INFO ncclCommInitRank comm 0x34ad9150 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xfb66864debd73870 - Init START
nid005424:253798:253943 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005424:253798:253943 [2] NCCL INFO comm 0x34ad9150 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:253798:253943 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:253798:253943 [2] NCCL INFO P2P Chunksize set to 524288
nid005424:253798:253943 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253798:253943 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253798:253943 [2] NCCL INFO ncclCommInitRank comm 0x34ad9150 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xfb66864debd73870 - Init COMPLETE
nid005424:253798:253943 [2] NCCL INFO Init timings: rank 1 nranks 2 total 1.13 (kernels 0.00, bootstrap 0.80, allgathers 0.00, topo 0.31, graphs 0.00, connections 0.02, rest 0.00)
nid005424:253798:253965 [2] NCCL INFO Channel 00/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 01/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 02/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 03/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 04/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 05/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 06/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Channel 07/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253965 [2] NCCL INFO Connected all rings
nid005424:253798:253798 [2] NCCL INFO Comm config Blocking set to 1
nid005424:253798:253991 [2] NCCL INFO Using network Libfabric
nid005424:253798:253991 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005424:253798:253991 [2] NCCL INFO ncclCommInitRank comm 0x34b28f80 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x3d8a4024a17bfc8e - Init START
nid005424:253798:253991 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005424:253798:253991 [2] NCCL INFO comm 0x34b28f80 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:253798:253991 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:253798:253991 [2] NCCL INFO P2P Chunksize set to 524288
nid005424:253798:253991 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253798:253991 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253798:253991 [2] NCCL INFO ncclCommInitRank comm 0x34b28f80 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x3d8a4024a17bfc8e - Init COMPLETE
nid005424:253798:253991 [2] NCCL INFO Init timings: rank 1 nranks 2 total 0.31 (kernels 0.00, bootstrap 0.14, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253798:254000 [2] NCCL INFO Channel 00/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 01/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 02/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 03/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 04/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 05/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 06/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Channel 07/0 : 1[2] -> 0[3] via P2P/CUMEM
nid005424:253798:254000 [2] NCCL INFO Connected all rings
nid005424:253798:253798 [2] NCCL INFO Comm config Blocking set to 1
nid005424:253798:254004 [2] NCCL INFO Using network Libfabric
nid005424:253798:254004 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005424:253798:254004 [2] NCCL INFO ncclCommInitRank comm 0x37f88490 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x79212cb16b67eedd - Init START
nid005424:253798:254004 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005424:253798:254004 [2] NCCL INFO NVLS multicast support is not available on dev 2
nid005424:253798:254004 [2] NCCL INFO comm 0x37f88490 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
nid005424:253798:254004 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->0 [5] 1/-1/-1->2->0 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] -1/-1/-1->2->3 [9] -1/-1/-1->2->3 [10] -1/-1/-1->2->3 [11] -1/-1/-1->2->3 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 1/-1/-1->2->0 [17] 1/-1/-1->2->0 [18] 1/-1/-1->2->0 [19] 1/-1/-1->2->0 [20] -1/-1/-1->2->3 [21] -1/-1/-1->2->3 [22] -1/-1/-1->2->3 [23] -1/-1/-1->2->3
nid005424:253798:254004 [2] NCCL INFO P2P Chunksize set to 524288
nid005424:253798:254004 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253798:254004 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253798:254004 [2] NCCL INFO ncclCommInitRank comm 0x37f88490 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x79212cb16b67eedd - Init COMPLETE
nid005424:253798:254004 [2] NCCL INFO Init timings: rank 2 nranks 4 total 0.41 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.32, graphs 0.03, connections 0.04, rest 0.01)
nid005424:253798:254027 [2] NCCL INFO Channel 01/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 03/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 05/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 07/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 09/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 11/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 13/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 15/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 17/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 19/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 21/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 23/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 25/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 27/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 29/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254027 [2] NCCL INFO Channel 31/1 : 2[2] -> 3[3] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 01/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 03/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 05/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 07/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 09/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 11/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 13/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 15/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 17/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 19/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 21/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 23/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 25/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 27/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 29/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:254031 [2] NCCL INFO Channel 31/1 : 2[2] -> 0[0] via P2P/CUMEM
nid005424:253798:253798 [2] NCCL INFO comm 0x34a924f0 rank 3 nranks 4 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005424:253798:253798 [2] NCCL INFO comm 0x34ad9150 rank 1 nranks 2 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005424:253798:253798 [2] NCCL INFO comm 0x34b28f80 rank 1 nranks 2 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005424:253798:253798 [2] NCCL INFO comm 0x37f88490 rank 2 nranks 4 cudaDev 2 busId 2901000 - Destroy COMPLETE

./potrf-potrs-4ranks-3.txt

Parameters: m=1 n=10 nrhs=1 mbA=2 nbA=2 mbB=2 nbB=2 mbQ=2 nbQ=2 mbZ=0 nbZ=0ia=3 ja=1 ib=3 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout= verbose=0
nid005424:253799:253799 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:253799:253799 [3] NCCL INFO cudaDriverVersion 12040
nid005424:253799:253799 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:253799:253799 [3] NCCL INFO Comm config Blocking set to 1
nid005424:253799:253897 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:253799:253897 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:253799:253897 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:253799:253897 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:253799:253897 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:253799:253897 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:253799:253897 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:253799:253897 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:253799:253897 [3] NCCL INFO NET/OFI Creating one domain per process
nid005424:253799:253897 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005424:253799:253897 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:253799:253897 [3] NCCL INFO Using network Libfabric
nid005424:253799:253897 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005424:253799:253897 [3] NCCL INFO ncclCommInitRank comm 0x83c25b0 rank 2 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xdb9dc2f9efbd4b3f - Init START
nid005424:253799:253897 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005424:253799:253897 [3] NCCL INFO NVLS multicast support is not available on dev 3
nid005424:253799:253897 [3] NCCL INFO comm 0x83c25b0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
nid005424:253799:253897 [3] NCCL INFO Trees [0] -1/-1/-1->2->3 [1] -1/-1/-1->2->3 [2] -1/-1/-1->2->3 [3] -1/-1/-1->2->3 [4] -1/-1/-1->2->1 [5] -1/-1/-1->2->1 [6] -1/-1/-1->2->1 [7] -1/-1/-1->2->1 [8] 3/-1/-1->2->0 [9] 3/-1/-1->2->0 [10] 3/-1/-1->2->0 [11] 3/-1/-1->2->0 [12] -1/-1/-1->2->3 [13] -1/-1/-1->2->3 [14] -1/-1/-1->2->3 [15] -1/-1/-1->2->3 [16] -1/-1/-1->2->1 [17] -1/-1/-1->2->1 [18] -1/-1/-1->2->1 [19] -1/-1/-1->2->1 [20] 3/-1/-1->2->0 [21] 3/-1/-1->2->0 [22] 3/-1/-1->2->0 [23] 3/-1/-1->2->0
nid005424:253799:253897 [3] NCCL INFO P2P Chunksize set to 524288
nid005424:253799:253897 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253799:253897 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253799:253897 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:253799:253897 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:253799:253897 [3] NCCL INFO ncclCommInitRank comm 0x83c25b0 rank 2 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xdb9dc2f9efbd4b3f - Init COMPLETE
nid005424:253799:253897 [3] NCCL INFO Init timings: rank 2 nranks 4 total 3.31 (kernels 0.09, bootstrap 2.54, allgathers 0.12, topo 0.47, graphs 0.03, connections 0.04, rest 0.02)
nid005424:253799:253927 [3] NCCL INFO Channel 01/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 05/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 07/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 11/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 13/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 17/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 19/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 23/0 : 2[3] -> 3[2] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 00/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 03/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 06/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 09/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 12/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 15/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 18/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 21/0 : 2[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 02/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 04/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 08/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 10/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 14/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 16/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 20/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Channel 22/0 : 2[3] -> 1[0] via P2P/CUMEM
nid005424:253799:253927 [3] NCCL INFO Connected all rings
nid005424:253799:253799 [3] NCCL INFO Comm config Blocking set to 1
nid005424:253799:253944 [3] NCCL INFO Using network Libfabric
nid005424:253799:253944 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005424:253799:253944 [3] NCCL INFO ncclCommInitRank comm 0x8409210 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xecf35a522473b0af - Init START
nid005424:253799:253944 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005424:253799:253944 [3] NCCL INFO comm 0x8409210 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:253799:253944 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:253799:253944 [3] NCCL INFO P2P Chunksize set to 524288
nid005424:253799:253944 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253799:253944 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253799:253944 [3] NCCL INFO ncclCommInitRank comm 0x8409210 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xecf35a522473b0af - Init COMPLETE
nid005424:253799:253944 [3] NCCL INFO Init timings: rank 1 nranks 2 total 1.14 (kernels 0.00, bootstrap 0.83, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253799:253967 [3] NCCL INFO Channel 00/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 01/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 02/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 03/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 04/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 05/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 06/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Channel 07/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005424:253799:253967 [3] NCCL INFO Connected all rings
nid005424:253799:253799 [3] NCCL INFO Comm config Blocking set to 1
nid005424:253799:253995 [3] NCCL INFO Using network Libfabric
nid005424:253799:253995 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005424:253799:253995 [3] NCCL INFO ncclCommInitRank comm 0x8459120 rank 0 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x3d8a4024a17bfc8e - Init START
nid005424:253799:253995 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005424:253799:253995 [3] NCCL INFO comm 0x8459120 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:253799:253995 [3] NCCL INFO Channel 00/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 01/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 02/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 03/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 04/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 05/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 06/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Channel 07/08 :    0   1
nid005424:253799:253995 [3] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:253799:253995 [3] NCCL INFO P2P Chunksize set to 524288
nid005424:253799:253995 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:253799:253995 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:253799:253995 [3] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:253799:253995 [3] NCCL INFO ncclCommInitRank comm 0x8459120 rank 0 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x3d8a4024a17bfc8e - Init COMPLETE
nid005424:253799:253995 [3] NCCL INFO Init timings: rank 0 nranks 2 total 0.17 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:253799:254001 [3] NCCL INFO Channel 00/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 01/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 02/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 03/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 04/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 05/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 06/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Channel 07/0 : 0[3] -> 1[2] via P2P/CUMEM
nid005424:253799:254001 [3] NCCL INFO Connected all rings
nid005424:253799:253799 [3] NCCL INFO Comm config Blocking set to 1
nid005424:253799:254002 [3] NCCL INFO Using network Libfabric
nid005424:253799:254002 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005424:253799:254002 [3] NCCL INFO ncclCommInitRank comm 0xa9beff0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x79212cb16b67eedd - Init START
nid005424:253799:254002 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005424:253799:254002 [3] NCCL INFO NVLS multicast support is not available on dev 3
nid005424:253799:254002 [3] NCCL INFO comm 0xa9beff0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
nid005424:253799:254002 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->1 [5] -1/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] -1/-1/-1->3->1 [8] 2/-1/-1->3->0 [9] 2/-1/-1->3->0 [10] 2/-1/-1->3->0 [11] 2/-1/-1->3->0 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->1 [17] -1/-1/-1->3->1 [18] -1/-1/-1->3->1 [19] -1/-1/-1->3->1 [20] 2/-1/-1->3->0 [21] 2/-1/-1->3->0 [22] 2/-1/-1->3->0 [23] 2/-1/-1->3->0
nid005424:253799:254002 [3] NCCL INFO P2P Chunksize set to 524288
nid005424:253799:254002 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005424:253799:254002 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005424:253799:254002 [3] NCCL INFO ncclCommInitRank comm 0xa9beff0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x79212cb16b67eedd - Init COMPLETE
nid005424:253799:254002 [3] NCCL INFO Init timings: rank 3 nranks 4 total 0.41 (kernels 0.00, bootstrap 0.01, allgathers 0.00, topo 0.32, graphs 0.03, connections 0.04, rest 0.01)
nid005424:253799:254032 [3] NCCL INFO Channel 01/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 03/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 05/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 07/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 09/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 11/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 13/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 15/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 17/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 19/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 21/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 23/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 25/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 27/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 29/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254032 [3] NCCL INFO Channel 31/1 : 3[3] -> 1[1] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 00/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 02/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 04/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 06/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 08/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 10/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 12/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 14/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 16/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 18/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 20/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 22/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 24/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 26/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 28/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:254064 [3] NCCL INFO Channel 30/1 : 3[3] -> 2[2] via P2P/CUMEM
nid005424:253799:253799 [3] NCCL INFO comm 0x83c25b0 rank 2 nranks 4 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005424:253799:253799 [3] NCCL INFO comm 0x8409210 rank 1 nranks 2 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005424:253799:253799 [3] NCCL INFO comm 0x8459120 rank 0 nranks 2 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005424:253799:253799 [3] NCCL INFO comm 0xa9beff0 rank 3 nranks 4 cudaDev 3 busId 3901000 - Destroy COMPLETE

SYEVD 2x1 (default)

NCCL_DEBUG=info srun -n2 -u -o syevd-2ranks-%t.txt ./mp_syevd

./syevd-2ranks-0.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0
nid005424:254823:254823 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:254823:254823 [0] NCCL INFO cudaDriverVersion 12040
nid005424:254823:254823 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:254823:254823 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254823:254877 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:254823:254877 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:254823:254877 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:254823:254877 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:254823:254877 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:254823:254877 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:254823:254877 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:254823:254877 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:254823:254877 [0] NCCL INFO NET/OFI Creating one domain per process
nid005424:254823:254877 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005424:254823:254877 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:254823:254877 [0] NCCL INFO Using network Libfabric
nid005424:254823:254877 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254823:254877 [0] NCCL INFO ncclCommInitRank comm 0x329a2250 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x65403952c23cf4e0 - Init START
nid005424:254823:254877 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254823:254877 [0] NCCL INFO comm 0x329a2250 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254823:254877 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:254823:254877 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254823:254877 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254823:254877 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254823:254877 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254823:254877 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254823:254877 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:254823:254877 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:254823:254877 [0] NCCL INFO ncclCommInitRank comm 0x329a2250 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x65403952c23cf4e0 - Init COMPLETE
nid005424:254823:254877 [0] NCCL INFO Init timings: rank 0 nranks 2 total 1.95 (kernels 0.81, bootstrap 0.97, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254823:254892 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254892 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254823 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254823:254898 [0] NCCL INFO Using network Libfabric
nid005424:254823:254898 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254823:254898 [0] NCCL INFO ncclCommInitRank comm 0x329eb8c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xce799d9015b65350 - Init START
nid005424:254823:254898 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254823:254898 [0] NCCL INFO comm 0x329eb8c0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254823:254898 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:254823:254898 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254823:254898 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254823:254898 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254823:254898 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254823:254898 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254823:254898 [0] NCCL INFO ncclCommInitRank comm 0x329eb8c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xce799d9015b65350 - Init COMPLETE
nid005424:254823:254898 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.27 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254823:254904 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254904 [0] NCCL INFO Connected all rings
nid005424:254823:254823 [0] NCCL INFO Comm config Blocking set to 1
nid005424:254823:254908 [0] NCCL INFO Using network Libfabric
nid005424:254823:254908 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:254823:254908 [0] NCCL INFO ncclCommInitRank comm 0x39afc2c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xac5db06efa00736 - Init START
nid005424:254823:254908 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:254823:254908 [0] NCCL INFO comm 0x39afc2c0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:254823:254908 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:254823:254908 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:254823:254908 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:254823:254908 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254823:254908 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254823:254908 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:254823:254908 [0] NCCL INFO ncclCommInitRank comm 0x39afc2c0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0xac5db06efa00736 - Init COMPLETE
nid005424:254823:254908 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254823:254914 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:254823:254914 [0] NCCL INFO Connected all rings
nid005424:254823:254823 [0] NCCL INFO comm 0x39afc2c0 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:254823:254823 [0] NCCL INFO comm 0x329eb8c0 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:254823:254823 [0] NCCL INFO comm 0x329a2250 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE

./syevd-2ranks-1.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=1 grid_layout= verbose=0
nid005424:254824:254824 [1] NCCL INFO cudaDriverVersion 12040
nid005424:254824:254824 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:254824:254824 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:254824:254824 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254824:254874 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:254824:254874 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:254824:254874 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:254824:254874 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:254824:254874 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:254824:254874 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:254824:254874 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:254824:254874 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:254824:254874 [1] NCCL INFO NET/OFI Creating one domain per process
nid005424:254824:254874 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005424:254824:254874 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:254824:254874 [1] NCCL INFO Using network Libfabric
nid005424:254824:254874 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254824:254874 [1] NCCL INFO ncclCommInitRank comm 0x279856d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x65403952c23cf4e0 - Init START
nid005424:254824:254874 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254824:254874 [1] NCCL INFO comm 0x279856d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254824:254874 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254824:254874 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254824:254874 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254824:254874 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254824:254874 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:254824:254874 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:254824:254874 [1] NCCL INFO ncclCommInitRank comm 0x279856d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x65403952c23cf4e0 - Init COMPLETE
nid005424:254824:254874 [1] NCCL INFO Init timings: rank 1 nranks 2 total 2.25 (kernels 0.10, bootstrap 1.98, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254824:254891 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254891 [1] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254824 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254824:254897 [1] NCCL INFO Using network Libfabric
nid005424:254824:254897 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254824:254897 [1] NCCL INFO ncclCommInitRank comm 0x27ce8630 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xce799d9015b65350 - Init START
nid005424:254824:254897 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254824:254897 [1] NCCL INFO comm 0x27ce8630 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254824:254897 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254824:254897 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254824:254897 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254824:254897 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254824:254897 [1] NCCL INFO ncclCommInitRank comm 0x27ce8630 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xce799d9015b65350 - Init COMPLETE
nid005424:254824:254897 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.27 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254824:254903 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254903 [1] NCCL INFO Connected all rings
nid005424:254824:254824 [1] NCCL INFO Comm config Blocking set to 1
nid005424:254824:254909 [1] NCCL INFO Using network Libfabric
nid005424:254824:254909 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:254824:254909 [1] NCCL INFO ncclCommInitRank comm 0x2f0b02d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xac5db06efa00736 - Init START
nid005424:254824:254909 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:254824:254909 [1] NCCL INFO comm 0x2f0b02d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:254824:254909 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:254824:254909 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:254824:254909 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:254824:254909 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:254824:254909 [1] NCCL INFO ncclCommInitRank comm 0x2f0b02d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xac5db06efa00736 - Init COMPLETE
nid005424:254824:254909 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.00)
nid005424:254824:254915 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:254824:254915 [1] NCCL INFO Connected all rings
nid005424:254824:254824 [1] NCCL INFO comm 0x2f0b02d0 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:254824:254824 [1] NCCL INFO comm 0x27ce8630 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:254824:254824 [1] NCCL INFO comm 0x279856d0 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE

SYEVD 1x2

NCCL_DEBUG=info srun -n2 -u -o syevd-1x2ranks-%t.txt ./mp_syevd -p 1 -q 2

./syevd-1x2ranks-0.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=1 q=2 grid_layout= verbose=0
nid005424:255323:255323 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:255323:255323 [0] NCCL INFO cudaDriverVersion 12040
nid005424:255323:255323 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:255323:255323 [0] NCCL INFO Comm config Blocking set to 1
nid005424:255323:255417 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:255323:255417 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:255323:255417 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:255323:255417 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:255323:255417 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:255323:255417 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:255323:255417 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:255323:255417 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:255323:255417 [0] NCCL INFO NET/OFI Creating one domain per process
nid005424:255323:255417 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005424:255323:255417 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:255323:255417 [0] NCCL INFO Using network Libfabric
nid005424:255323:255417 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:255323:255417 [0] NCCL INFO ncclCommInitRank comm 0x3f981d80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x836a94f099eab5c9 - Init START
nid005424:255323:255417 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:255323:255417 [0] NCCL INFO comm 0x3f981d80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:255323:255417 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:255323:255417 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:255323:255417 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:255323:255417 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255323:255417 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255323:255417 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:255323:255417 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:255323:255417 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:255323:255417 [0] NCCL INFO ncclCommInitRank comm 0x3f981d80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x836a94f099eab5c9 - Init COMPLETE
nid005424:255323:255417 [0] NCCL INFO Init timings: rank 0 nranks 2 total 2.16 (kernels 0.09, bootstrap 1.91, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255323:255433 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255433 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255323 [0] NCCL INFO Comm config Blocking set to 1
nid005424:255323:255441 [0] NCCL INFO Using network Libfabric
nid005424:255323:255441 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:255323:255441 [0] NCCL INFO ncclCommInitRank comm 0x41da66a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x59fd6880ba6ab252 - Init START
nid005424:255323:255441 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:255323:255441 [0] NCCL INFO comm 0x41da66a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:255323:255441 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:255323:255441 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:255323:255441 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:255323:255441 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255323:255441 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255323:255441 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:255323:255441 [0] NCCL INFO ncclCommInitRank comm 0x41da66a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x59fd6880ba6ab252 - Init COMPLETE
nid005424:255323:255441 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.26 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255323:255449 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255449 [0] NCCL INFO Connected all rings
nid005424:255323:255323 [0] NCCL INFO Comm config Blocking set to 1
nid005424:255323:255451 [0] NCCL INFO Using network Libfabric
nid005424:255323:255451 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005424:255323:255451 [0] NCCL INFO ncclCommInitRank comm 0x46a84140 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x999b0ad20548e6c2 - Init START
nid005424:255323:255451 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005424:255323:255451 [0] NCCL INFO comm 0x46a84140 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005424:255323:255451 [0] NCCL INFO Channel 00/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 01/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 02/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 03/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 04/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 05/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 06/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Channel 07/08 :    0   1
nid005424:255323:255451 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005424:255323:255451 [0] NCCL INFO P2P Chunksize set to 524288
nid005424:255323:255451 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255323:255451 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255323:255451 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005424:255323:255451 [0] NCCL INFO ncclCommInitRank comm 0x46a84140 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x999b0ad20548e6c2 - Init COMPLETE
nid005424:255323:255451 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.16 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255323:255461 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005424:255323:255461 [0] NCCL INFO Connected all rings
nid005424:255323:255323 [0] NCCL INFO comm 0x46a84140 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:255323:255323 [0] NCCL INFO comm 0x41da66a0 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005424:255323:255323 [0] NCCL INFO comm 0x3f981d80 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE

./syevd-1x2ranks-1.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=1 q=2 grid_layout= verbose=0
nid005424:255324:255324 [1] NCCL INFO cudaDriverVersion 12040
nid005424:255324:255324 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.75<0>
nid005424:255324:255324 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005424:255324:255324 [1] NCCL INFO Comm config Blocking set to 1
nid005424:255324:255418 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005424:255324:255418 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005424:255324:255418 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005424:255324:255418 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005424:255324:255418 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005424:255324:255418 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005424:255324:255418 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005424:255324:255418 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005424:255324:255418 [1] NCCL INFO NET/OFI Creating one domain per process
nid005424:255324:255418 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005424:255324:255418 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005424:255324:255418 [1] NCCL INFO Using network Libfabric
nid005424:255324:255418 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:255324:255418 [1] NCCL INFO ncclCommInitRank comm 0xb0b5230 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x836a94f099eab5c9 - Init START
nid005424:255324:255418 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:255324:255418 [1] NCCL INFO comm 0xb0b5230 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:255324:255418 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:255324:255418 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:255324:255418 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255324:255418 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255324:255418 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005424:255324:255418 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005424:255324:255418 [1] NCCL INFO ncclCommInitRank comm 0xb0b5230 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x836a94f099eab5c9 - Init COMPLETE
nid005424:255324:255418 [1] NCCL INFO Init timings: rank 1 nranks 2 total 2.15 (kernels 0.08, bootstrap 1.91, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255324:255439 [1] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255439 [1] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255324 [1] NCCL INFO Comm config Blocking set to 1
nid005424:255324:255440 [1] NCCL INFO Using network Libfabric
nid005424:255324:255440 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:255324:255440 [1] NCCL INFO ncclCommInitRank comm 0xd7eaf90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x59fd6880ba6ab252 - Init START
nid005424:255324:255440 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:255324:255440 [1] NCCL INFO comm 0xd7eaf90 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:255324:255440 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:255324:255440 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:255324:255440 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255324:255440 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255324:255440 [1] NCCL INFO ncclCommInitRank comm 0xd7eaf90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x59fd6880ba6ab252 - Init COMPLETE
nid005424:255324:255440 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.26 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255324:255450 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255450 [1] NCCL INFO Connected all rings
nid005424:255324:255324 [1] NCCL INFO Comm config Blocking set to 1
nid005424:255324:255452 [1] NCCL INFO Using network Libfabric
nid005424:255324:255452 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005424:255324:255452 [1] NCCL INFO ncclCommInitRank comm 0x123d6df0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x999b0ad20548e6c2 - Init START
nid005424:255324:255452 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005424:255324:255452 [1] NCCL INFO comm 0x123d6df0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005424:255324:255452 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005424:255324:255452 [1] NCCL INFO P2P Chunksize set to 524288
nid005424:255324:255452 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005424:255324:255452 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005424:255324:255452 [1] NCCL INFO ncclCommInitRank comm 0x123d6df0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x999b0ad20548e6c2 - Init COMPLETE
nid005424:255324:255452 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.15 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.01, rest 0.00)
nid005424:255324:255460 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005424:255324:255460 [1] NCCL INFO Connected all rings
nid005424:255324:255324 [1] NCCL INFO comm 0x123d6df0 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:255324:255324 [1] NCCL INFO comm 0xd7eaf90 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005424:255324:255324 [1] NCCL INFO comm 0xb0b5230 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE

Another environment variable that comes to mind is "UCC_TL_NCCL_LAZY_INIT=no". Can you please give it a try and let us know if anything seems to change?

Actually, using UCC_TL_NCCL_LAZY_INIT=no it does not hang anymore.

UCC_TL_NCCL_LAZY_INIT=no

UCC_TL_NCCL_LAZY_INIT=no NCCL_DEBUG=info srun -n4 -u -o with-ucc-nccl-lazy-%t.txt ./mp_syevd -p 2 -q 2

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-ucc-nccl-lazy-0.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29176:29176 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29176:29176 [0] NCCL INFO cudaDriverVersion 12040
nid005428:29176:29176 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29176:29176 [0] NCCL INFO Comm config Blocking set to 1
nid005428:29176:29249 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29176:29249 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29176:29249 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29176:29249 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29176:29249 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29176:29249 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29176:29249 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29176:29249 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29176:29249 [0] NCCL INFO NET/OFI Creating one domain per process
nid005428:29176:29249 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29176:29249 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29176:29249 [0] NCCL INFO Using network Libfabric
nid005428:29176:29249 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005428:29176:29249 [0] NCCL INFO ncclCommInitRank comm 0x2ebd0af0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0xf3f6dd76363c981e - Init START
nid005428:29176:29249 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005428:29176:29249 [0] NCCL INFO NVLS multicast support is not available on dev 0
nid005428:29176:29249 [0] NCCL INFO comm 0x2ebd0af0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
nid005428:29176:29249 [0] NCCL INFO Channel 00/24 :    0   1   2   3
nid005428:29176:29249 [0] NCCL INFO Channel 01/24 :    0   1   3   2
nid005428:29176:29249 [0] NCCL INFO Channel 02/24 :    0   2   3   1
nid005428:29176:29249 [0] NCCL INFO Channel 03/24 :    0   2   1   3
nid005428:29176:29249 [0] NCCL INFO Channel 04/24 :    0   3   1   2
nid005428:29176:29249 [0] NCCL INFO Channel 05/24 :    0   3   2   1
nid005428:29176:29249 [0] NCCL INFO Channel 06/24 :    0   1   2   3
nid005428:29176:29249 [0] NCCL INFO Channel 07/24 :    0   1   3   2
nid005428:29176:29249 [0] NCCL INFO Channel 08/24 :    0   2   3   1
nid005428:29176:29249 [0] NCCL INFO Channel 09/24 :    0   2   1   3
nid005428:29176:29249 [0] NCCL INFO Channel 10/24 :    0   3   1   2
nid005428:29176:29249 [0] NCCL INFO Channel 11/24 :    0   3   2   1
nid005428:29176:29249 [0] NCCL INFO Channel 12/24 :    0   1   2   3
nid005428:29176:29249 [0] NCCL INFO Channel 13/24 :    0   1   3   2
nid005428:29176:29249 [0] NCCL INFO Channel 14/24 :    0   2   3   1
nid005428:29176:29249 [0] NCCL INFO Channel 15/24 :    0   2   1   3
nid005428:29176:29249 [0] NCCL INFO Channel 16/24 :    0   3   1   2
nid005428:29176:29249 [0] NCCL INFO Channel 17/24 :    0   3   2   1
nid005428:29176:29249 [0] NCCL INFO Channel 18/24 :    0   1   2   3
nid005428:29176:29249 [0] NCCL INFO Channel 19/24 :    0   1   3   2
nid005428:29176:29249 [0] NCCL INFO Channel 20/24 :    0   2   3   1
nid005428:29176:29249 [0] NCCL INFO Channel 21/24 :    0   2   1   3
nid005428:29176:29249 [0] NCCL INFO Channel 22/24 :    0   3   1   2
nid005428:29176:29249 [0] NCCL INFO Channel 23/24 :    0   3   2   1
nid005428:29176:29249 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 2/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 2/-1/-1->0->-1 [7] 2/-1/-1->0->-1 [8] 3/-1/-1->0->1 [9] 3/-1/-1->0->1 [10] 3/-1/-1->0->1 [11] 3/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 2/-1/-1->0->-1 [17] 2/-1/-1->0->-1 [18] 2/-1/-1->0->-1 [19] 2/-1/-1->0->-1 [20] 3/-1/-1->0->1 [21] 3/-1/-1->0->1 [22] 3/-1/-1->0->1 [23] 3/-1/-1->0->1
nid005428:29176:29249 [0] NCCL INFO P2P Chunksize set to 524288
nid005428:29176:29249 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29176:29249 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29176:29249 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29176:29249 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005428:29176:29249 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005428:29176:29249 [0] NCCL INFO ncclCommInitRank comm 0x2ebd0af0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0xf3f6dd76363c981e - Init COMPLETE
nid005428:29176:29249 [0] NCCL INFO Init timings: rank 0 nranks 4 total 3.04 (kernels 0.10, bootstrap 2.53, allgathers 0.00, topo 0.33, graphs 0.03, connections 0.04, rest 0.02)
nid005428:29176:29176 [0] NCCL INFO Comm config Blocking set to 1
nid005428:29176:29302 [0] NCCL INFO Using network Libfabric
nid005428:29176:29302 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005428:29176:29302 [0] NCCL INFO ncclCommInitRank comm 0x2f4f1b50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0x552d7e5ae73a2563 - Init START
nid005428:29176:29302 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005428:29176:29302 [0] NCCL INFO NVLS multicast support is not available on dev 0
nid005428:29176:29302 [0] NCCL INFO comm 0x2f4f1b50 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
nid005428:29176:29302 [0] NCCL INFO Channel 00/24 :    0   1   2   3
nid005428:29176:29302 [0] NCCL INFO Channel 01/24 :    0   1   3   2
nid005428:29176:29302 [0] NCCL INFO Channel 02/24 :    0   2   3   1
nid005428:29176:29302 [0] NCCL INFO Channel 03/24 :    0   2   1   3
nid005428:29176:29302 [0] NCCL INFO Channel 04/24 :    0   3   1   2
nid005428:29176:29302 [0] NCCL INFO Channel 05/24 :    0   3   2   1
nid005428:29176:29302 [0] NCCL INFO Channel 06/24 :    0   1   2   3
nid005428:29176:29302 [0] NCCL INFO Channel 07/24 :    0   1   3   2
nid005428:29176:29302 [0] NCCL INFO Channel 08/24 :    0   2   3   1
nid005428:29176:29302 [0] NCCL INFO Channel 09/24 :    0   2   1   3
nid005428:29176:29302 [0] NCCL INFO Channel 10/24 :    0   3   1   2
nid005428:29176:29302 [0] NCCL INFO Channel 11/24 :    0   3   2   1
nid005428:29176:29302 [0] NCCL INFO Channel 12/24 :    0   1   2   3
nid005428:29176:29302 [0] NCCL INFO Channel 13/24 :    0   1   3   2
nid005428:29176:29302 [0] NCCL INFO Channel 14/24 :    0   2   3   1
nid005428:29176:29302 [0] NCCL INFO Channel 15/24 :    0   2   1   3
nid005428:29176:29302 [0] NCCL INFO Channel 16/24 :    0   3   1   2
nid005428:29176:29302 [0] NCCL INFO Channel 17/24 :    0   3   2   1
nid005428:29176:29302 [0] NCCL INFO Channel 18/24 :    0   1   2   3
nid005428:29176:29302 [0] NCCL INFO Channel 19/24 :    0   1   3   2
nid005428:29176:29302 [0] NCCL INFO Channel 20/24 :    0   2   3   1
nid005428:29176:29302 [0] NCCL INFO Channel 21/24 :    0   2   1   3
nid005428:29176:29302 [0] NCCL INFO Channel 22/24 :    0   3   1   2
nid005428:29176:29302 [0] NCCL INFO Channel 23/24 :    0   3   2   1
nid005428:29176:29302 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 2/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 2/-1/-1->0->-1 [7] 2/-1/-1->0->-1 [8] 3/-1/-1->0->1 [9] 3/-1/-1->0->1 [10] 3/-1/-1->0->1 [11] 3/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 2/-1/-1->0->-1 [17] 2/-1/-1->0->-1 [18] 2/-1/-1->0->-1 [19] 2/-1/-1->0->-1 [20] 3/-1/-1->0->1 [21] 3/-1/-1->0->1 [22] 3/-1/-1->0->1 [23] 3/-1/-1->0->1
nid005428:29176:29302 [0] NCCL INFO P2P Chunksize set to 524288
nid005428:29176:29302 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29176:29302 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29176:29302 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29176:29302 [0] NCCL INFO ncclCommInitRank comm 0x2f4f1b50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 901000 commId 0x552d7e5ae73a2563 - Init COMPLETE
nid005428:29176:29302 [0] NCCL INFO Init timings: rank 0 nranks 4 total 0.60 (kernels 0.00, bootstrap 0.07, allgathers 0.00, topo 0.45, graphs 0.03, connections 0.04, rest 0.01)
nid005428:29176:29176 [0] NCCL INFO Comm config Blocking set to 1
nid005428:29176:29317 [0] NCCL INFO Using network Libfabric
nid005428:29176:29317 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005428:29176:29317 [0] NCCL INFO ncclCommInitRank comm 0x2f5274e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x428ecd542a5cea99 - Init START
nid005428:29176:29317 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005428:29176:29317 [0] NCCL INFO comm 0x2f5274e0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005428:29176:29317 [0] NCCL INFO Channel 00/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 01/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 02/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 03/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 04/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 05/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 06/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Channel 07/08 :    0   1
nid005428:29176:29317 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005428:29176:29317 [0] NCCL INFO P2P Chunksize set to 524288
nid005428:29176:29317 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29176:29317 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29176:29317 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29176:29317 [0] NCCL INFO ncclCommInitRank comm 0x2f5274e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x428ecd542a5cea99 - Init COMPLETE
nid005428:29176:29317 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.35 (kernels 0.00, bootstrap 0.04, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29176:29176 [0] NCCL INFO Comm config Blocking set to 1
nid005428:29176:29337 [0] NCCL INFO Using network Libfabric
nid005428:29176:29337 [0] NCCL INFO DMA-BUF is available on GPU device 0
nid005428:29176:29337 [0] NCCL INFO ncclCommInitRank comm 0x2f560920 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x780986b8e9d83c8d - Init START
nid005428:29176:29337 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
nid005428:29176:29337 [0] NCCL INFO comm 0x2f560920 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005428:29176:29337 [0] NCCL INFO Channel 00/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 01/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 02/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 03/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 04/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 05/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 06/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Channel 07/08 :    0   1
nid005428:29176:29337 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005428:29176:29337 [0] NCCL INFO P2P Chunksize set to 524288
nid005428:29176:29337 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29176:29337 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29176:29337 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29176:29337 [0] NCCL INFO ncclCommInitRank comm 0x2f560920 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x780986b8e9d83c8d - Init COMPLETE
nid005428:29176:29337 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.32 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.31, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29176:29349 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29349 [0] NCCL INFO Connected all rings
nid005428:29176:29359 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[2] via P2P/CUMEM
nid005428:29176:29359 [0] NCCL INFO Connected all rings
nid005428:29176:29371 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 02/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 03/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 08/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 09/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 14/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 15/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 20/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 21/0 : 0[0] -> 2[2] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 04/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 10/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 11/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 16/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 17/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 22/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Channel 23/0 : 0[0] -> 3[3] via P2P/CUMEM
nid005428:29176:29371 [0] NCCL INFO Connected all rings
nid005428:29176:29176 [0] NCCL INFO comm 0x2f4f1b50 rank 0 nranks 4 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005428:29176:29176 [0] NCCL INFO comm 0x2f5274e0 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005428:29176:29176 [0] NCCL INFO comm 0x2f560920 rank 0 nranks 2 cudaDev 0 busId 901000 - Destroy COMPLETE
nid005428:29176:29176 [0] NCCL INFO comm 0x2ebd0af0 rank 0 nranks 4 cudaDev 0 busId 901000 - Destroy COMPLETE

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-ucc-nccl-lazy-1.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29177:29177 [1] NCCL INFO cudaDriverVersion 12040
nid005428:29177:29177 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29177:29177 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29177:29177 [1] NCCL INFO Comm config Blocking set to 1
nid005428:29177:29251 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29177:29251 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29177:29251 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29177:29251 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29177:29251 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29177:29251 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29177:29251 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29177:29251 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29177:29251 [1] NCCL INFO NET/OFI Creating one domain per process
nid005428:29177:29251 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29177:29251 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29177:29251 [1] NCCL INFO Using network Libfabric
nid005428:29177:29251 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005428:29177:29251 [1] NCCL INFO ncclCommInitRank comm 0x3e2dec40 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xf3f6dd76363c981e - Init START
nid005428:29177:29251 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005428:29177:29251 [1] NCCL INFO NVLS multicast support is not available on dev 1
nid005428:29177:29251 [1] NCCL INFO comm 0x3e2dec40 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
nid005428:29177:29251 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 3/-1/-1->1->2 [17] 3/-1/-1->1->2 [18] 3/-1/-1->1->2 [19] 3/-1/-1->1->2 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
nid005428:29177:29251 [1] NCCL INFO P2P Chunksize set to 524288
nid005428:29177:29251 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29177:29251 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29177:29251 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005428:29177:29251 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005428:29177:29251 [1] NCCL INFO ncclCommInitRank comm 0x3e2dec40 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xf3f6dd76363c981e - Init COMPLETE
nid005428:29177:29251 [1] NCCL INFO Init timings: rank 1 nranks 4 total 3.02 (kernels 0.07, bootstrap 2.53, allgathers 0.00, topo 0.33, graphs 0.03, connections 0.04, rest 0.02)
nid005428:29177:29177 [1] NCCL INFO Comm config Blocking set to 1
nid005428:29177:29303 [1] NCCL INFO Using network Libfabric
nid005428:29177:29303 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005428:29177:29303 [1] NCCL INFO ncclCommInitRank comm 0x3ec29800 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x552d7e5ae73a2563 - Init START
nid005428:29177:29303 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005428:29177:29303 [1] NCCL INFO NVLS multicast support is not available on dev 1
nid005428:29177:29303 [1] NCCL INFO comm 0x3ec29800 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
nid005428:29177:29303 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 3/-1/-1->1->2 [17] 3/-1/-1->1->2 [18] 3/-1/-1->1->2 [19] 3/-1/-1->1->2 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
nid005428:29177:29303 [1] NCCL INFO P2P Chunksize set to 524288
nid005428:29177:29303 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29177:29303 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29177:29303 [1] NCCL INFO ncclCommInitRank comm 0x3ec29800 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x552d7e5ae73a2563 - Init COMPLETE
nid005428:29177:29303 [1] NCCL INFO Init timings: rank 1 nranks 4 total 0.60 (kernels 0.00, bootstrap 0.07, allgathers 0.00, topo 0.45, graphs 0.03, connections 0.04, rest 0.01)
nid005428:29177:29177 [1] NCCL INFO Comm config Blocking set to 1
nid005428:29177:29320 [1] NCCL INFO Using network Libfabric
nid005428:29177:29320 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005428:29177:29320 [1] NCCL INFO ncclCommInitRank comm 0x3ec66170 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xed8fa64eec944412 - Init START
nid005428:29177:29320 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005428:29177:29320 [1] NCCL INFO comm 0x3ec66170 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005428:29177:29320 [1] NCCL INFO Channel 00/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 01/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 02/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 03/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 04/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 05/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 06/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Channel 07/08 :    0   1
nid005428:29177:29320 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005428:29177:29320 [1] NCCL INFO P2P Chunksize set to 524288
nid005428:29177:29320 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29177:29320 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29177:29320 [1] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29177:29320 [1] NCCL INFO ncclCommInitRank comm 0x3ec66170 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0xed8fa64eec944412 - Init COMPLETE
nid005428:29177:29320 [1] NCCL INFO Init timings: rank 0 nranks 2 total 0.32 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29177:29177 [1] NCCL INFO Comm config Blocking set to 1
nid005428:29177:29336 [1] NCCL INFO Using network Libfabric
nid005428:29177:29336 [1] NCCL INFO DMA-BUF is available on GPU device 1
nid005428:29177:29336 [1] NCCL INFO ncclCommInitRank comm 0x3eca6110 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x780986b8e9d83c8d - Init START
nid005428:29177:29336 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,ffffff00,00000000,00000000
nid005428:29177:29336 [1] NCCL INFO comm 0x3eca6110 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005428:29177:29336 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005428:29177:29336 [1] NCCL INFO P2P Chunksize set to 524288
nid005428:29177:29336 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29177:29336 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29177:29336 [1] NCCL INFO ncclCommInitRank comm 0x3eca6110 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1901000 commId 0x780986b8e9d83c8d - Init COMPLETE
nid005428:29177:29336 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.32 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.31, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29177:29347 [1] NCCL INFO Channel 01/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 03/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 05/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 07/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 09/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 11/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 13/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 15/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 17/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 19/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 21/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 23/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 25/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 27/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 29/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29347 [1] NCCL INFO Channel 31/1 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29355 [1] NCCL INFO Connected all rings
nid005428:29177:29361 [1] NCCL INFO Channel 00/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 01/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 02/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 03/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 04/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 05/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 06/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Channel 07/0 : 0[1] -> 1[3] via P2P/CUMEM
nid005428:29177:29361 [1] NCCL INFO Connected all rings
nid005428:29177:29370 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 03/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 07/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 09/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 13/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 15/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 19/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 21/0 : 1[1] -> 3[3] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
nid005428:29177:29370 [1] NCCL INFO Connected all rings
nid005428:29177:29177 [1] NCCL INFO comm 0x3ec29800 rank 1 nranks 4 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005428:29177:29177 [1] NCCL INFO comm 0x3ec66170 rank 0 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005428:29177:29177 [1] NCCL INFO comm 0x3eca6110 rank 1 nranks 2 cudaDev 1 busId 1901000 - Destroy COMPLETE
nid005428:29177:29177 [1] NCCL INFO comm 0x3e2dec40 rank 1 nranks 4 cudaDev 1 busId 1901000 - Destroy COMPLETE

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-ucc-nccl-lazy-2.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29178:29178 [2] NCCL INFO cudaDriverVersion 12040
nid005428:29178:29178 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29178:29178 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29178:29178 [2] NCCL INFO Comm config Blocking set to 1
nid005428:29178:29252 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29178:29252 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29178:29252 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29178:29252 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29178:29252 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29178:29252 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29178:29252 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29178:29252 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29178:29252 [2] NCCL INFO NET/OFI Creating one domain per process
nid005428:29178:29252 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29178:29252 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29178:29252 [2] NCCL INFO Using network Libfabric
nid005428:29178:29252 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005428:29178:29252 [2] NCCL INFO ncclCommInitRank comm 0x1958ecf0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xf3f6dd76363c981e - Init START
nid005428:29178:29252 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005428:29178:29252 [2] NCCL INFO NVLS multicast support is not available on dev 2
nid005428:29178:29252 [2] NCCL INFO comm 0x1958ecf0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
nid005428:29178:29252 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->0 [5] 1/-1/-1->2->0 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] -1/-1/-1->2->3 [9] -1/-1/-1->2->3 [10] -1/-1/-1->2->3 [11] -1/-1/-1->2->3 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 1/-1/-1->2->0 [17] 1/-1/-1->2->0 [18] 1/-1/-1->2->0 [19] 1/-1/-1->2->0 [20] -1/-1/-1->2->3 [21] -1/-1/-1->2->3 [22] -1/-1/-1->2->3 [23] -1/-1/-1->2->3
nid005428:29178:29252 [2] NCCL INFO P2P Chunksize set to 524288
nid005428:29178:29252 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29178:29252 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29178:29252 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005428:29178:29252 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005428:29178:29252 [2] NCCL INFO ncclCommInitRank comm 0x1958ecf0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0xf3f6dd76363c981e - Init COMPLETE
nid005428:29178:29252 [2] NCCL INFO Init timings: rank 2 nranks 4 total 3.02 (kernels 0.07, bootstrap 2.53, allgathers 0.00, topo 0.33, graphs 0.03, connections 0.04, rest 0.02)
nid005428:29178:29178 [2] NCCL INFO Comm config Blocking set to 1
nid005428:29178:29300 [2] NCCL INFO Using network Libfabric
nid005428:29178:29300 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005428:29178:29300 [2] NCCL INFO ncclCommInitRank comm 0x19ed9530 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x552d7e5ae73a2563 - Init START
nid005428:29178:29300 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005428:29178:29300 [2] NCCL INFO NVLS multicast support is not available on dev 2
nid005428:29178:29300 [2] NCCL INFO comm 0x19ed9530 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
nid005428:29178:29300 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->0 [5] 1/-1/-1->2->0 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] -1/-1/-1->2->3 [9] -1/-1/-1->2->3 [10] -1/-1/-1->2->3 [11] -1/-1/-1->2->3 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 1/-1/-1->2->0 [17] 1/-1/-1->2->0 [18] 1/-1/-1->2->0 [19] 1/-1/-1->2->0 [20] -1/-1/-1->2->3 [21] -1/-1/-1->2->3 [22] -1/-1/-1->2->3 [23] -1/-1/-1->2->3
nid005428:29178:29300 [2] NCCL INFO P2P Chunksize set to 524288
nid005428:29178:29300 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29178:29300 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29178:29300 [2] NCCL INFO ncclCommInitRank comm 0x19ed9530 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x552d7e5ae73a2563 - Init COMPLETE
nid005428:29178:29300 [2] NCCL INFO Init timings: rank 2 nranks 4 total 0.60 (kernels 0.00, bootstrap 0.07, allgathers 0.00, topo 0.45, graphs 0.03, connections 0.04, rest 0.01)
nid005428:29178:29178 [2] NCCL INFO Comm config Blocking set to 1
nid005428:29178:29316 [2] NCCL INFO Using network Libfabric
nid005428:29178:29316 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005428:29178:29316 [2] NCCL INFO ncclCommInitRank comm 0x19f15e20 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x428ecd542a5cea99 - Init START
nid005428:29178:29316 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005428:29178:29316 [2] NCCL INFO comm 0x19f15e20 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005428:29178:29316 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005428:29178:29316 [2] NCCL INFO P2P Chunksize set to 524288
nid005428:29178:29316 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29178:29316 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29178:29316 [2] NCCL INFO ncclCommInitRank comm 0x19f15e20 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x428ecd542a5cea99 - Init COMPLETE
nid005428:29178:29316 [2] NCCL INFO Init timings: rank 1 nranks 2 total 0.35 (kernels 0.00, bootstrap 0.04, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29178:29178 [2] NCCL INFO Comm config Blocking set to 1
nid005428:29178:29331 [2] NCCL INFO Using network Libfabric
nid005428:29178:29331 [2] NCCL INFO DMA-BUF is available on GPU device 2
nid005428:29178:29331 [2] NCCL INFO ncclCommInitRank comm 0x19f55d60 rank 0 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x787e4febd298f8f0 - Init START
nid005428:29178:29331 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
nid005428:29178:29331 [2] NCCL INFO comm 0x19f55d60 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
nid005428:29178:29331 [2] NCCL INFO Channel 00/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 01/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 02/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 03/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 04/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 05/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 06/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Channel 07/08 :    0   1
nid005428:29178:29331 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
nid005428:29178:29331 [2] NCCL INFO P2P Chunksize set to 524288
nid005428:29178:29331 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29178:29331 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29178:29331 [2] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
nid005428:29178:29331 [2] NCCL INFO ncclCommInitRank comm 0x19f55d60 rank 0 nranks 2 cudaDev 2 nvmlDev 2 busId 2901000 commId 0x787e4febd298f8f0 - Init COMPLETE
nid005428:29178:29331 [2] NCCL INFO Init timings: rank 0 nranks 2 total 0.36 (kernels 0.00, bootstrap 0.02, allgathers 0.00, topo 0.32, graphs 0.00, connections 0.02, rest 0.00)
nid005428:29178:29354 [2] NCCL INFO Channel 00/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 02/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 04/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 06/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 08/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 10/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 12/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 14/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 16/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 18/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 20/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 22/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 24/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 26/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 28/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29354 [2] NCCL INFO Channel 30/1 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/CUMEM
nid005428:29178:29360 [2] NCCL INFO Connected all rings
nid005428:29178:29362 [2] NCCL INFO Channel 00/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 01/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 02/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 03/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 04/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 05/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 06/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Channel 07/0 : 1[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29362 [2] NCCL INFO Connected all rings
nid005428:29178:29372 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 04/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 07/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 10/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 13/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 16/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 19/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 22/0 : 2[2] -> 0[0] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM
nid005428:29178:29372 [2] NCCL INFO Connected all rings
nid005428:29178:29178 [2] NCCL INFO comm 0x19ed9530 rank 2 nranks 4 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005428:29178:29178 [2] NCCL INFO comm 0x19f15e20 rank 1 nranks 2 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005428:29178:29178 [2] NCCL INFO comm 0x19f55d60 rank 0 nranks 2 cudaDev 2 busId 2901000 - Destroy COMPLETE
nid005428:29178:29178 [2] NCCL INFO comm 0x1958ecf0 rank 2 nranks 4 cudaDev 2 busId 2901000 - Destroy COMPLETE

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/with-ucc-nccl-lazy-3.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29179:29179 [3] NCCL INFO cudaDriverVersion 12040
nid005428:29179:29179 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29179:29179 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29179:29179 [3] NCCL INFO Comm config Blocking set to 1
nid005428:29179:29250 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29179:29250 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29179:29250 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29179:29250 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29179:29250 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29179:29250 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29179:29250 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29179:29250 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29179:29250 [3] NCCL INFO NET/OFI Creating one domain per process
nid005428:29179:29250 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29179:29250 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29179:29250 [3] NCCL INFO Using network Libfabric
nid005428:29179:29250 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005428:29179:29250 [3] NCCL INFO ncclCommInitRank comm 0x40d7e6e0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xf3f6dd76363c981e - Init START
nid005428:29179:29250 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005428:29179:29250 [3] NCCL INFO NVLS multicast support is not available on dev 3
nid005428:29179:29250 [3] NCCL INFO comm 0x40d7e6e0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
nid005428:29179:29250 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->1 [5] -1/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] -1/-1/-1->3->1 [8] 2/-1/-1->3->0 [9] 2/-1/-1->3->0 [10] 2/-1/-1->3->0 [11] 2/-1/-1->3->0 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->1 [17] -1/-1/-1->3->1 [18] -1/-1/-1->3->1 [19] -1/-1/-1->3->1 [20] 2/-1/-1->3->0 [21] 2/-1/-1->3->0 [22] 2/-1/-1->3->0 [23] 2/-1/-1->3->0
nid005428:29179:29250 [3] NCCL INFO P2P Chunksize set to 524288
nid005428:29179:29250 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29179:29250 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29179:29250 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
nid005428:29179:29250 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
nid005428:29179:29250 [3] NCCL INFO ncclCommInitRank comm 0x40d7e6e0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xf3f6dd76363c981e - Init COMPLETE
nid005428:29179:29250 [3] NCCL INFO Init timings: rank 3 nranks 4 total 3.03 (kernels 0.08, bootstrap 2.53, allgathers 0.00, topo 0.33, graphs 0.03, connections 0.04, rest 0.02)
nid005428:29179:29179 [3] NCCL INFO Comm config Blocking set to 1
nid005428:29179:29301 [3] NCCL INFO Using network Libfabric
nid005428:29179:29301 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005428:29179:29301 [3] NCCL INFO ncclCommInitRank comm 0x416c8f40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x552d7e5ae73a2563 - Init START
nid005428:29179:29301 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005428:29179:29301 [3] NCCL INFO NVLS multicast support is not available on dev 3
nid005428:29179:29301 [3] NCCL INFO comm 0x416c8f40 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
nid005428:29179:29301 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->1 [5] -1/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] -1/-1/-1->3->1 [8] 2/-1/-1->3->0 [9] 2/-1/-1->3->0 [10] 2/-1/-1->3->0 [11] 2/-1/-1->3->0 [12] -1/-1/-1->3->2 [13] -1/-1/-1->3->2 [14] -1/-1/-1->3->2 [15] -1/-1/-1->3->2 [16] -1/-1/-1->3->1 [17] -1/-1/-1->3->1 [18] -1/-1/-1->3->1 [19] -1/-1/-1->3->1 [20] 2/-1/-1->3->0 [21] 2/-1/-1->3->0 [22] 2/-1/-1->3->0 [23] 2/-1/-1->3->0
nid005428:29179:29301 [3] NCCL INFO P2P Chunksize set to 524288
nid005428:29179:29301 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
nid005428:29179:29301 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 16 p2p channels per peer
nid005428:29179:29301 [3] NCCL INFO ncclCommInitRank comm 0x416c8f40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x552d7e5ae73a2563 - Init COMPLETE
nid005428:29179:29301 [3] NCCL INFO Init timings: rank 3 nranks 4 total 0.60 (kernels 0.00, bootstrap 0.07, allgathers 0.00, topo 0.45, graphs 0.03, connections 0.04, rest 0.01)
nid005428:29179:29179 [3] NCCL INFO Comm config Blocking set to 1
nid005428:29179:29319 [3] NCCL INFO Using network Libfabric
nid005428:29179:29319 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005428:29179:29319 [3] NCCL INFO ncclCommInitRank comm 0x41705830 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xed8fa64eec944412 - Init START
nid005428:29179:29319 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005428:29179:29319 [3] NCCL INFO comm 0x41705830 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005428:29179:29319 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005428:29179:29319 [3] NCCL INFO P2P Chunksize set to 524288
nid005428:29179:29319 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29179:29319 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29179:29319 [3] NCCL INFO ncclCommInitRank comm 0x41705830 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0xed8fa64eec944412 - Init COMPLETE
nid005428:29179:29319 [3] NCCL INFO Init timings: rank 1 nranks 2 total 0.32 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.30, graphs 0.00, connections 0.01, rest 0.00)
nid005428:29179:29179 [3] NCCL INFO Comm config Blocking set to 1
nid005428:29179:29330 [3] NCCL INFO Using network Libfabric
nid005428:29179:29330 [3] NCCL INFO DMA-BUF is available on GPU device 3
nid005428:29179:29330 [3] NCCL INFO ncclCommInitRank comm 0x417456f0 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x787e4febd298f8f0 - Init START
nid005428:29179:29330 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ff000000,00000000,00000000,00000000,00000000,00000000,00000000
nid005428:29179:29330 [3] NCCL INFO comm 0x417456f0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
nid005428:29179:29330 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
nid005428:29179:29330 [3] NCCL INFO P2P Chunksize set to 524288
nid005428:29179:29330 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nid005428:29179:29330 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
nid005428:29179:29330 [3] NCCL INFO ncclCommInitRank comm 0x417456f0 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 3901000 commId 0x787e4febd298f8f0 - Init COMPLETE
nid005428:29179:29330 [3] NCCL INFO Init timings: rank 1 nranks 2 total 0.36 (kernels 0.00, bootstrap 0.02, allgathers 0.00, topo 0.32, graphs 0.00, connections 0.02, rest 0.00)
nid005428:29179:29348 [3] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 04/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 05/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 06/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Channel 07/0 : 1[3] -> 0[2] via P2P/CUMEM
nid005428:29179:29348 [3] NCCL INFO Connected all rings
nid005428:29179:29363 [3] NCCL INFO Channel 00/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 01/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 02/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 03/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 04/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 05/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 06/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Channel 07/0 : 1[3] -> 0[1] via P2P/CUMEM
nid005428:29179:29363 [3] NCCL INFO Connected all rings
nid005428:29179:29373 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 08/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 14/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 16/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 20/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 22/0 : 3[3] -> 1[1] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM
nid005428:29179:29373 [3] NCCL INFO Connected all rings
nid005428:29179:29179 [3] NCCL INFO comm 0x416c8f40 rank 3 nranks 4 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005428:29179:29179 [3] NCCL INFO comm 0x41705830 rank 1 nranks 2 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005428:29179:29179 [3] NCCL INFO comm 0x417456f0 rank 1 nranks 2 cudaDev 3 busId 3901000 - Destroy COMPLETE
nid005428:29179:29179 [3] NCCL INFO comm 0x40d7e6e0 rank 3 nranks 4 cudaDev 3 busId 3901000 - Destroy COMPLETE

unset UCC_TL_NCCL_LAZY_INIT

NCCL_DEBUG=info srun -n4 -u -o without-ucc-nccl-lazy-%t.txt ./mp_syevd -p 2 -q 2

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-ucc-nccl-lazy-0.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29522:29522 [0] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29522:29522 [0] NCCL INFO cudaDriverVersion 12040
nid005428:29522:29522 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29522:29522 [0] NCCL INFO Comm config Blocking set to 1
nid005428:29522:29602 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29522:29602 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29522:29602 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29522:29602 [0] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29522:29602 [0] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29522:29602 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29522:29602 [0] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29522:29602 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29522:29602 [0] NCCL INFO NET/OFI Creating one domain per process
nid005428:29522:29602 [0] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29522:29602 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29522:29602 [0] NCCL INFO Using network Libfabric
slurmstepd: error: *** STEP 1264141.3 ON nid005428 CANCELLED AT 2025-07-08T16:49:56 ***

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-ucc-nccl-lazy-1.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29523:29523 [1] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29523:29523 [1] NCCL INFO cudaDriverVersion 12040
nid005428:29523:29523 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29523:29523 [1] NCCL INFO Comm config Blocking set to 1
nid005428:29523:29601 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29523:29601 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29523:29601 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29523:29601 [1] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29523:29601 [1] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29523:29601 [1] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29523:29601 [1] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29523:29601 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29523:29601 [1] NCCL INFO NET/OFI Creating one domain per process
nid005428:29523:29601 [1] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29523:29601 [1] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29523:29601 [1] NCCL INFO Using network Libfabric

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-ucc-nccl-lazy-2.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29524:29524 [2] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29524:29524 [2] NCCL INFO cudaDriverVersion 12040
nid005428:29524:29524 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29524:29524 [2] NCCL INFO Comm config Blocking set to 1
nid005428:29524:29600 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29524:29600 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29524:29600 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29524:29600 [2] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29524:29600 [2] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29524:29600 [2] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29524:29600 [2] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29524:29600 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29524:29600 [2] NCCL INFO NET/OFI Creating one domain per process
nid005428:29524:29600 [2] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29524:29600 [2] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29524:29600 [2] NCCL INFO Using network Libfabric

/capstor/scratch/cscs/ialberto/2025-04-01-cusolvermp-spack/CUDALibrarySamples/cuSOLVERMp/build/without-ucc-nccl-lazy-3.txt

Parameters: m=1 n=128 nrhs=1 mbA=32 nbA=32 mbB=32 nbB=32 mbQ=32 nbQ=32 mbZ=0 nbZ=0ia=1 ja=1 ib=1 jb=1 iq=1 jq=1 iz=0 jz=0 p=2 q=2 grid_layout=verbose=0
nid005428:29525:29525 [3] NCCL INFO cudaDriverVersion 12040
nid005428:29525:29525 [3] NCCL INFO Bootstrap : Using nmn0:10.100.48.116<0>
nid005428:29525:29525 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
nid005428:29525:29525 [3] NCCL INFO Comm config Blocking set to 1
nid005428:29525:29603 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nid005428:29525:29603 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nid005428:29525:29603 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
nid005428:29525:29603 [3] NCCL INFO NET/OFI Using Libfabric version 1.15
nid005428:29525:29603 [3] NCCL INFO NET/OFI Using CUDA driver version 12040 with runtime 12060

nid005428:29525:29603 [3] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: Function not implemented
nid005428:29525:29603 [3] NCCL INFO NET/OFI Selected provider is cxi, fabric is cxi (found 4 nics)
nid005428:29525:29603 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
nid005428:29525:29603 [3] NCCL INFO NET/OFI Creating one domain per process
nid005428:29525:29603 [3] NCCL INFO NET/OFI Support for global registrations: false
nid005428:29525:29603 [3] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
nid005428:29525:29603 [3] NCCL INFO Using network Libfabric

We'd like to use cuSOLVERMp on Alps GH200 with Slingshot.

From cuSOLVERMp doc results that it depends on UCC/UCX. In particular requirements are:

  • HPC-X (OpenUCC and OpenUCX). Alternatively "instead of HPC-X, you can install OpenUCX and OpenUCC manually."
  • CAL Communication Abstraction Layer. AFAIK it is not opensource and it is provided as pre-built library. I looked into it and actually it is dynamcally linked to UCC
readelf -d lib/libcal.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libucc.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

In particular, UCC and UCX, in my little (=no) experience, cannot be used on Alps Slingshot network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment