fix the work index low value issue in Cuda graph enabled scenario. (#20)

Andyli1007 · borisfom · Pau Farré · web-flow · commit 00b5c62e37f9 · 2023-08-29T11:30:23.000+08:00
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou &lt;rong.ou@gmail.com&gt;

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="&lt;suffix&gt;" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-&lt;suffix&gt;.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include &lt;cstring&gt;`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG &gt; NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;
Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;
Signed-off-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;
Co-authored-by: Boris Fomitchev &lt;bfomitchev@nvidia.com&gt;
Co-authored-by: Boris Fomitchev &lt;borisfom@users.noreply.github.com&gt;
Co-authored-by: Pau Farré &lt;pau.farre@bsc.es&gt;
Co-authored-by: Adam Paszke &lt;adam.paszke@gmail.com&gt;
Co-authored-by: Sylvain Jeaugey &lt;sjeaugey@nvidia.com&gt;
Co-authored-by: Nathan Luehr &lt;nluehr@nvidia.com&gt;
Co-authored-by: jiakai &lt;jia.kai66@gmail.com&gt;
Co-authored-by: Cliff Woolley &lt;cliffwoolley@users.noreply.github.com&gt;
Co-authored-by: Kyle Fernandes, ne Jacobs &lt;kj333@cam.ac.uk&gt;
Co-authored-by: Peter Jin &lt;peterhaijin@gmail.com&gt;
Co-authored-by: Chad Whipkey &lt;cwhipkey@google.com&gt;
Co-authored-by: Ilya Biryukov &lt;ibiryukov@google.com&gt;
Co-authored-by: sclarkson &lt;sclarkson@users.noreply.github.com&gt;
Co-authored-by: Obihörnchen &lt;Markus.Hilger@de.ibm.com&gt;
Co-authored-by: David Addison &lt;daddison@nvidia.com&gt;
Co-authored-by: Ke Wen &lt;kwen@nvidia.com&gt;
Co-authored-by: Alex Sergeev &lt;alexander.sergeev@live.com&gt;
Co-authored-by: Christian Sigg &lt;csigg@google.com&gt;
Co-authored-by: Christian Sigg &lt;chsigg@users.noreply.github.com&gt;
Co-authored-by: Rong Ou &lt;rong.ou@gmail.com&gt;
Co-authored-by: Cao Zongyan &lt;zongyan.cao@alibaba-inc.com&gt;
Co-authored-by: Gustavo Alvarez &lt;462213+sl1pkn07@users.noreply.github.com&gt;
Co-authored-by: jakirkham &lt;conv.ivlc8fxjd7ur5d@fleep.io&gt;
Co-authored-by: Rajat Chopra &lt;rajatchopra@gmail.com&gt;
Co-authored-by: Felix Abecassis &lt;felix.abecassis@gmail.com&gt;
Co-authored-by: Hirochika Asai &lt;panda@jar.jp&gt;
Co-authored-by: Luke Yeager &lt;lyeager@nvidia.com&gt;
Co-authored-by: Rashika Kheria &lt;rashika@amazon.com&gt;
Co-authored-by: aokomoriuta &lt;internal@enmps.net&gt;
Co-authored-by: Riatre Foo &lt;foo@riat.re&gt;
Co-authored-by: Eric Badger &lt;eric@badgerio.us&gt;
Co-authored-by: Jack Snyder &lt;jsnyder@prom.nvidia.com&gt;
Co-authored-by: xietingwew &lt;xietingwen@meituan.com&gt;
Co-authored-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;
Co-authored-by: John Bachan &lt;jbachan@nvidia.com&gt;
Co-authored-by: Chris Jones &lt;cjfj@deepmind.com&gt;
Co-authored-by: Ke Wen &lt;kw2501@fb.com&gt;
Co-authored-by: Chang Lan &lt;changlan@google.com&gt;
Co-authored-by: void-main &lt;voidmain1313113@gmail.com&gt;
Co-authored-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;
Co-authored-by: Christopher Hesse &lt;christopherhesse@users.noreply.github.com&gt;
Co-authored-by: Ching-Hsiang Chu &lt;chchu@fb.com&gt;
Co-authored-by: Jane Xu &lt;janeyx@fb.com&gt;
Co-authored-by: root &lt;root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net&gt;

* Enable msccl capability (#5)

* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou &lt;rong.ou@gmail.com&gt;

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="&lt;suffix&gt;" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-&lt;suffix&gt;.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include &lt;cstring&gt;`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG &gt; NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;
Signed-off-by: Rashika Kheria &lt;rashika@amazon.com&gt;
Signed-off-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;
Co-authored-by: Boris Fomitchev &lt;bfomitchev@nvidia.com&gt;
Co-authored-by: Boris Fomitchev &lt;borisfom@users.noreply.github.com&gt;
Co-authored-by: Pau Farré &lt;pau.farre@bsc.es&gt;
Co-authored-by: Adam Paszke &lt;adam.paszke@gmail.com&gt;
Co-authored-by: Sylvain Jeaugey &lt;sjeaugey@nvidia.com&gt;
Co-authored-by: Nathan Luehr &lt;nluehr@nvidia.com&gt;
Co-authored-by: jiakai &lt;jia.kai66@gmail.com&gt;
Co-authored-by: Cliff Woolley &lt;cliffwoolley@users.noreply.github.com&gt;
Co-authored-by: Kyle Fernandes, ne Jacobs &lt;kj333@cam.ac.uk&gt;
Co-authored-by: Peter Jin &lt;peterhaijin@gmail.com&gt;
Co-authored-by: Chad Whipkey &lt;cwhipkey@google.com&gt;
Co-authored-by: Ilya Biryukov &lt;ibiryukov@google.com&gt;
Co-authored-by: sclarkson &lt;sclarkson@users.noreply.github.com&gt;
Co-authored-by: Obihörnchen &lt;Markus.Hilger@de.ibm.com&gt;
Co-authored-by: David Addison &lt;daddison@nvidia.com&gt;
Co-authored-by: Ke Wen &lt;kwen@nvidia.com&gt;
Co-authored-by: Alex Sergeev &lt;alexander.sergeev@live.com&gt;
Co-authored-by: Christian Sigg &lt;csigg@google.com&gt;
Co-authored-by: Christian Sigg &lt;chsigg@users.noreply.github.com&gt;
Co-authored-by: Rong Ou &lt;rong.ou@gmail.com&gt;
Co-authored-by: Cao Zongyan &lt;zongyan.cao@alibaba-inc.com&gt;
Co-authored-by: Gustavo Alvarez &lt;462213+sl1pkn07@users.noreply.github.com&gt;
Co-authored-by: jakirkham &lt;conv.ivlc8fxjd7ur5d@fleep.io&gt;
Co-authored-by: Rajat Chopra &lt;rajatchopra@gmail.com&gt;
Co-authored-by: Felix Abecassis &lt;felix.abecassis@gmail.com&gt;
Co-authored-by: Hirochika Asai &lt;panda@jar.jp&gt;
Co-authored-by: Luke Yeager &lt;lyeager@nvidia.com&gt;
Co-authored-by: Rashika Kheria &lt;rashika@amazon.com&gt;
Co-authored-by: aokomoriuta &lt;internal@enmps.net&gt;
Co-authored-by: Riatre Foo &lt;foo@riat.re&gt;
Co-authored-by: Eric Badger &lt;eric@badgerio.us&gt;
Co-authored-by: Jack Snyder &lt;jsnyder@prom.nvidia.com&gt;
Co-authored-by: xietingwew &lt;xietingwen@meituan.com&gt;
Co-authored-by: Jonas Zhou &lt;JonasZhou@zhaoxin.com&gt;
Co-authored-by: John Bachan &lt;jbachan@nvidia.com&gt;
Co-authored-by: Chris Jones &lt;cjfj@deepmind.com&gt;
Co-authored-by: Ke Wen &lt;kw2501@fb.com&gt;
Co-authored-by: Chang Lan &lt;changlan@google.com&gt;
Co-authored-by: void-main &lt;voidmain1313113@gmail.com&gt;
Co-authored-by: Felix Abecassis &lt;fabecassis@nvidia.com&gt;
Co-authored-by: Christopher Hesse &lt;christopherhesse@users.noreply.github.com&gt;
Co-authored-by: Ching-Hsiang Chu &lt;chchu@fb.com&gt;
Co-authored-by: Jane Xu &lt;janeyx@fb.com&gt;
Co-authored-by: root &lt;root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net&gt;
Co-authored-by: root &lt;root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net&gt;

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou &lt;rong.ou@gmail.com&gt;

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will…
diff --git a/src/collectives/device/msccl_kernel.cu b/src/collectives/device/msccl_kernel.cu
@@ -20,6 +20,9 @@ __shared__ struct mscclShmemData mscclShmem;
 #define COMPUTE_FLAG(__WORKINDEX__,__GRIDOFFSET_ITER__,__STEP__) \
   MSCCL_MAX_ITER*MSCCL_MAX_NUM_STEPS*(uint64_t)__WORKINDEX__ + ((uint64_t)__GRIDOFFSET_ITER__ * MSCCL_MAX_NUM_STEPS + (uint64_t)__STEP__)
 
+#define GET_WORKINDEX_FROM_FLAG(__FLAG__) \
+  (__FLAG__) / (MSCCL_MAX_ITER*MSCCL_MAX_NUM_STEPS)
+
 // a copy of the volatile load/store from prims_ll
 template<typename U>
 __device__ static U load(U *src) {
@@ -198,7 +201,10 @@ __device__ __forceinline__ void mscclRunInterpreter(
           int8_t dependentBid = mscclShmem.mscclTB.dependentBid[dependentPointer+tid];
           int16_t dependentStep = mscclShmem.mscclTB.dependentStep[dependentPointer+tid];
           uint64_t goalFlag = COMPUTE_FLAG(workIndex, iter, dependentStep);
-          while ((mscclFlags + dependentBid)->flag < goalFlag);
+          while (true){
+            uint64_t curFlag = (mscclFlags + dependentBid)->flag;
+            if (curFlag >= goalFlag && GET_WORKINDEX_FROM_FLAG(curFlag) == workIndex) break;
+          }
         }
         step += numDependencies-1;
         barrier(nthreads);