-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does NCCL not support one GPU multiple NICs in all2all collective communication? #804
Comments
What version are you using? We had that issue on old versions but it was supposed to be fixed now. |
We used NCCL 2.14.3. Which version are you supposed to fix that problem? @sjeaugey |
Hi, Sylvain, could you please answer this question for us? @sjeaugey |
Sorry for the delay. Turns out I was wrong and it wasn't fixed. We're working on a fix right now, we'll post updates here when ready. |
Really great news! We are looking forward to your new NCCL version. In NCCL 2.14.3, we have hacked the function |
Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.
I was testing NCCL between two node, and each node has one GPU and two NICs in the same NUMA, connecting directing with one switch. Each NIC is assigned with a different IP address.
I found that in all_reduce, NCCL already support one GPU two NICs, and the bandwidth of these two NICs can be fully utilized.
However, in all2all, only one NIC is used, and the other NIC is left used.
Therefore, my question is, why does NCCL not support one GPU two NICs in alltoall? Is there some difficulty that prevent you from supporting such a feature?
The text was updated successfully, but these errors were encountered: