Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different nChannels # due to different PCIe topology #439

Open
ctsengwen opened this issue Dec 14, 2020 · 5 comments
Open

different nChannels # due to different PCIe topology #439

ctsengwen opened this issue Dec 14, 2020 · 5 comments

Comments

@ctsengwen
Copy link

Running on an AMD w/ A100 HGX. Have been seeing different bandwidth results, it seems to be due to the topologies.

Why did the mlx5_6/mlx5_7 pair come out better bandwidth than mlx5_0/mlx5_1 pair?

The mlx5_0/mlx5_1 pair only used 1 HDR HCA.

20201212_debug_mlx5_6-7_gpu_4-5_detail.txt

20201212_debug_mlx5_0-1_gpu_2-3_detail.txt

@sjeaugey
Copy link
Member

sjeaugey commented Dec 14, 2020

In the bad case, the topology shows both NICs under the same PCI lane, so the topology search does not see how using 2 NICs would get better bandwidth since both would need to go on the single link between 0000:52:00.0 (PCI/52000) and 0000:41:00.0(PCI/41000). So we end up with only 1 ring instead of 2, hence 24 GB/s instead of 48 GB/s.

Edit : add the relevant topology graph

CPU/1 (1/2/-1)
 + PCI[24.0] - PCI/41000
               + PCI[24.0] - PCI/43000
                             + PCI[24.0] - PCI/45000
                                           + PCI[24.0] - GPU/47000 (0)
                                                         + NVL[252.0] - NVS/0
               + PCI[24.0] - PCI/49000
                             + PCI[24.0] - PCI/4B000
                                           + PCI[24.0] - GPU/4D000 (1)
                                                         + NVL[252.0] - NVS/0
               + PCI[24.0] - PCI/52000
                             + PCI[24.0] - NIC/54000
                                           + NET[25.0] - NET/0 (e39650003a1420c/1/25.000000)
                             + PCI[24.0] - NIC/55000
                                           + NET[25.0] - NET/1 (45a600003a1420c/1/25.000000)

@ctsengwen
Copy link
Author

but actually 2 GPUs and 2 HDR NICs are connected to the non-block PCIe Switch and the bandwidth is there. It's different from the scenario like single card with dual controller. Where in the code can I tweak a bit and verify the function? likely to skip the judgement, keep the channel, get sameChannels=0 and use both 2 rings.

We'd need to run with all the system resources.

@ctsengwen
Copy link
Author

BTW, it's actually two 200G HDR NICs but not single device with dual ports. Full bandwidth can be utilized.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 15, 2020

The PCI switch is advertising a two-level hierarchy, where both NICs (HDR, Gen4 x16) are on a single sub-switch. So, at least from what the switch advertises as the PCI topology, there is a bottleneck and we cannot use full bandwidth.

Now maybe what's advertised is not true; in which case I'm not sure why the switch would show that hierarchy; it should just show a flat switch.

As a workaround, you could dump the topology (NCCL_TOPO_DUMP_FILE=system.xml), remove all <gpu> and <nic> tags, remove that intermediate switch level so that PCI devices are all directly attached to PCI/41000, then re-inject it into NCCL (NCCL_TOPO_FILE=system.xml).

But the clean solution is the have the PCI switch advertise something that's real, or rebalance NICs onto different PCI slots so that they're not on the same sub-switch.

@ctsengwen
Copy link
Author

ctsengwen commented Dec 15, 2020

Tweaked the system.xml and got the expected performance.. Thanks Sylvain! So much appreciated.
Will work with NVIDIA team and our internal team to see how we work out the right solution for this.

#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
delta01:8074:8074 [0] NCCL INFO Launch mode Parallel
           8             2   float     sum     9.75    0.00    0.00  0e+00     9.73    0.00    0.00  0e+00
          16             4   float     sum     9.74    0.00    0.00  0e+00     9.64    0.00    0.00  0e+00
          32             8   float     sum     9.75    0.00    0.00  0e+00     9.70    0.00    0.00  0e+00
          64            16   float     sum     9.95    0.01    0.01  0e+00     9.71    0.01    0.01  0e+00
         128            32   float     sum     9.99    0.01    0.01  0e+00     9.82    0.01    0.01  0e+00
         256            64   float     sum    10.01    0.03    0.03  0e+00     9.77    0.03    0.03  0e+00
         512           128   float     sum    10.05    0.05    0.05  0e+00     9.87    0.05    0.05  0e+00
        1024           256   float     sum    10.40    0.10    0.10  0e+00    10.06    0.10    0.10  0e+00
        2048           512   float     sum    10.62    0.19    0.19  0e+00    10.24    0.20    0.20  0e+00
        4096          1024   float     sum    11.04    0.37    0.37  0e+00    10.81    0.38    0.38  0e+00
        8192          2048   float     sum    13.09    0.63    0.63  0e+00    12.65    0.65    0.65  0e+00
       16384          4096   float     sum    18.50    0.89    0.89  0e+00    18.09    0.91    0.91  0e+00
       32768          8192   float     sum    21.16    1.55    1.55  0e+00    20.80    1.58    1.58  0e+00
       65536         16384   float     sum    26.90    2.44    2.44  0e+00    26.55    2.47    2.47  0e+00
      131072         32768   float     sum    39.67    3.30    3.30  0e+00    38.75    3.38    3.38  0e+00
      262144         65536   float     sum    51.38    5.10    5.10  0e+00    50.81    5.16    5.16  0e+00
      524288        131072   float     sum    56.66    9.25    9.25  0e+00    55.95    9.37    9.37  0e+00
     1048576        262144   float     sum    67.84   15.46   15.46  0e+00    68.81   15.24   15.24  0e+00
     2097152        524288   float     sum    89.75   23.37   23.37  0e+00    85.87   24.42   24.42  0e+00
     4194304       1048576   float     sum    92.30   45.44   45.44  0e+00    92.92   45.14   45.14  0e+00
     8388608       2097152   float     sum    125.8   66.70   66.70  0e+00    130.0   64.51   64.51  0e+00
    16777216       4194304   float     sum    169.2   99.14   99.14  0e+00    171.3   97.95   97.95  0e+00
    33554432       8388608   float     sum    278.9  120.29  120.29  0e+00    279.1  120.24  120.24  0e+00
    67108864      16777216   float     sum    465.9  144.03  144.03  0e+00    410.5  163.49  163.49  0e+00
   134217728      33554432   float     sum    782.0  171.64  171.64  0e+00    780.5  171.97  171.97  0e+00
   268435456      67108864   float     sum   1475.4  181.94  181.94  0e+00   1471.3  182.45  182.45  0e+00
   536870912     134217728   float     sum   2832.3  189.56  189.56  0e+00   2853.1  188.17  188.17  0e+00
  1073741824     268435456   float     sum   5431.1  197.70  197.70  0e+00   5425.9  197.89  197.89  0e+00
  2147483648     536870912   float     sum    10502  204.49  204.49  0e+00    10487  204.78  204.78  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 51.4517

sjeaugey added a commit that referenced this issue Apr 16, 2021
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants