-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different nChannels # due to different PCIe topology #439
Comments
In the bad case, the topology shows both NICs under the same PCI lane, so the topology search does not see how using 2 NICs would get better bandwidth since both would need to go on the single link between 0000:52:00.0 (PCI/52000) and 0000:41:00.0(PCI/41000). So we end up with only 1 ring instead of 2, hence 24 GB/s instead of 48 GB/s. Edit : add the relevant topology graph
|
but actually 2 GPUs and 2 HDR NICs are connected to the non-block PCIe Switch and the bandwidth is there. It's different from the scenario like single card with dual controller. Where in the code can I tweak a bit and verify the function? likely to skip the judgement, keep the channel, get sameChannels=0 and use both 2 rings. We'd need to run with all the system resources. |
BTW, it's actually two 200G HDR NICs but not single device with dual ports. Full bandwidth can be utilized. |
The PCI switch is advertising a two-level hierarchy, where both NICs (HDR, Gen4 x16) are on a single sub-switch. So, at least from what the switch advertises as the PCI topology, there is a bottleneck and we cannot use full bandwidth. Now maybe what's advertised is not true; in which case I'm not sure why the switch would show that hierarchy; it should just show a flat switch. As a workaround, you could dump the topology ( But the clean solution is the have the PCI switch advertise something that's real, or rebalance NICs onto different PCI slots so that they're not on the same sub-switch. |
Tweaked the system.xml and got the expected performance.. Thanks Sylvain! So much appreciated.
|
Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node.
Running on an AMD w/ A100 HGX. Have been seeing different bandwidth results, it seems to be due to the topologies.
Why did the mlx5_6/mlx5_7 pair come out better bandwidth than mlx5_0/mlx5_1 pair?
The mlx5_0/mlx5_1 pair only used 1 HDR HCA.
20201212_debug_mlx5_6-7_gpu_4-5_detail.txt
20201212_debug_mlx5_0-1_gpu_2-3_detail.txt
The text was updated successfully, but these errors were encountered: