Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL 2.13 Preview #682

Merged
merged 1 commit into from
Jul 14, 2022
Merged

NCCL 2.13 Preview #682

merged 1 commit into from
Jul 14, 2022

Conversation

sjeaugey
Copy link
Member

Pushing this branch as a preview of NCCL 2.13. Feel free to give it a try and report issues on this PR.

@rashikakheria @changlan this is introducing the v6 NET plugin API which includes support for dmabuf. There is still a v4->v5->v6 compat layer.

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.
@rajachan
Copy link

Is there a driver+device+runtime stack combination net_ib requires for use with a NVIDIA GPU as a dmabuf exporter?

@sjeaugey sjeaugey deleted the v2.13 branch July 15, 2022 07:42
@spotluri
Copy link
Collaborator

@rajachan these are the requirements
Toolkit version: 11.7+
Driver version: 515.43.04+ w/ OpenRM (default for Turing+)
Device: Turing+

kingchc added a commit to kingchc/nccl that referenced this pull request Aug 4, 2022
Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
sjeaugey pushed a commit that referenced this pull request Aug 18, 2022
Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
spease-fb pushed a commit to spease-fb/nccl that referenced this pull request Dec 21, 2022
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
spease-fb pushed a commit to spease-fb/nccl that referenced this pull request Dec 21, 2022
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
facebook-github-bot pushed a commit to facebookresearch/nccl that referenced this pull request Dec 23, 2022
Summary:
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Pull Request resolved: #17

Test Plan:
These commands are to be applied to the whole stack rather than this specific commit:

```
buck2 run fbcode//mode/opt -c hpc_comms.use_nccl=exp fbcode//param_bench/train/comms/cpp/nccl-tests:nccl_tests_launcher -- --launcher mast --hw tc_any --nnode 2 --collective allreduce,alltoall,allgather,reducescatter --nccl-args "-b 4 -e 1G -f 2 -z 1" --dp ai_system_sw_hw_co-design_cws --entitlement codesign
```
https://www.internalfb.com/mast/job/torchx-nccl-test-allreduce-alltoall-allgather-reducescatter-3301bc

Reviewed By: agangidi53

Differential Revision: D42194038

Pulled By: spease-fb

fbshipit-source-id: e2cde6c44bcb8494c9dd02e926938f27bbc8f43b
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jun 26, 2023
* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

NVIDIA/nccl#287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to NVIDIA/nccl#560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc19.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes NVIDIA/nccl#649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes NVIDIA/nccl#726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jun 26, 2023
* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

NVIDIA/nccl#287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to NVIDIA/nccl#560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc19.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes NVIDIA/nccl#649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes NVIDIA/nccl#726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

* remove test related asset

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jun 28, 2023
* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

NVIDIA/nccl#287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to NVIDIA/nccl#560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc19.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes NVIDIA/nccl#649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes NVIDIA/nccl#726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

* remove test related asset

* update the readme

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jul 10, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

* remove test related asset

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#4)

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add N…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jul 10, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.



* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).



* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.



* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors



* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging



* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.



* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------
























































* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.



* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).



* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.



* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors



* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging



* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.



* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

* remove test related asset

---------
























































* Enable msccl capability (#4)

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.



* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add N…

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jul 21, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#5)

* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with ea…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jul 26, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#5)

* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distin…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jul 27, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#5)

* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distin…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Aug 29, 2023
* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#5)

* Enable msccl capability (#1)

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* porting from rccl Add MSCCL Support #658

* continue porting for #658, add misc/msccl

* porting from rccl Add MSCCL Support #694

* remove unnessary changing history comments

* resolve the build related issue

* enable  msccl when generate the top info during initialization

* fix nccl BF check issue

* fix the memory alignmentissue for LL prooto

* add fp8 support

* add alltoall interface

* add mullt-imemory supportfor fp8

* fix  the test script bug which failed to generate the algo file in certain conditions

* fix the protocol simple thread count issue aand impprove the test tool with operator support

* fix test script issue

* fix the memory confliict issue in simple protocol

* fix the ll128 share memory issue and 1 process with multiple GPU issue

* remove the unessary code to enable cuda graph and add perf test scenarios

* fix fp8 issue

* fix max,min mismatch issue and ll128 shared memory issue

* turn off the ops when ops index equal or larger than avg

* optimize the test script to accomodate the gpu numbers

* support perf test env initialization

* fix environment prepare bug

* fix environment prepare bug

* enable auto build if the nccl does not build before

* add custimozed topo and graph file support during test

* enable ndv5 test scenarios

* enable cuda graph for multiple node

* initiate the multi node test

* enable ncv4 compability

* fix multi-node test issue

* enable multi-node for ndv4 test scenario

* ix ib bandwidth test case issue

* fix the fence, proxy和sync flags setting related issues

* unified the topo file name for different sku

* add vmss creatation script

* fix test case issue for multi-node scenario

* change the algo for multi node scenario

* change the maxbyte to smaller like 65565 for multi node scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#2)

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Sep 5, 2023
* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* initial checkin

* fix the build issue when cuda version larger than 12

* enable ncv4 test scenarios

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* modify the test script to support training and inference test scenarios

* Prevent WR index truncation in the InfiniBand transport plugin

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* SECURITY.md committed

* README.md committed

* SUPPORT.md committed

* fix build break of previous FI

* remove test related assert

* Create msccl-algorithms folder by default

* Create msccl-algorithms folder by default

* enable make install & deb package for msccl

* include string header file for compiler compability

* resolve the build capability issue for arch below 800

* fix the logic issue of chunks calculation for cpu proxy

* fix the memory access violation issue when using simple protocol

* fix the work index issue when using cuda graph (#9)

* Msccl v2.18 (#2)

* Fixed bug in MPI initialization.

* Use semantic versioning

* Build SM 5.0 code

* Don't link tests with NVML

* Added Debian packaging files

* Update deb packaging scripts

* fix a typo in README.md

* Fixed deadlock in back-to-back reduce_scatters.

Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added support for more than 8 GPUs.

Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Moved tests to separate dir and improved MPI test

test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Enabled support for char type to be unsigned.

GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added NCCL error checking to tests.

Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Libwrap checks for LIB.so.1 if LIB.so not found

Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed buffer overflow in ReduceOrCopy

Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed useRemoteRecv consistency issue.

Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Updated package version, added manpage

* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* initial checkin

* fix the build issue when cuda version larger than 12

* enable ncv4 test scenarios

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* modify the test script to support training and inference test scenarios

* Prevent WR index truncation in the InfiniBand transport plugin

* fix build break of previous FI

* remove test related assert

* Create msccl-algorithms folder by default

* fix the memory access violation issue when using simple protocol

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <[email protected]>
Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: root <root@superbench000008.5czzseio4l3u3nxaefzoshwirc.jx.internal.cloudapp.net>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Sep 12, 2023
* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* initial checkin

* fix the build issue when cuda version larger than 12

* enable ncv4 test scenarios

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* modify the test script to support training and inference test scenarios

* Prevent WR index truncation in the InfiniBand transport plugin

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* SECURITY.md committed

* README.md committed

* SUPPORT.md committed

* fix build break of previous FI

* remove test related assert

* Create msccl-algorithms folder by default

* Create msccl-algorithms folder by default

* enable make install & deb package for msccl

* include string header file for compiler compability

* resolve the build capability issue for arch below 800

* fix the logic issue of chunks calculation for cpu proxy

* fix the memory access violation issue when using simple protocol

* fix the work index issue when using cuda graph (#9)

* Msccl v2.18 (#2)

* Fixed bug in MPI initialization.

* Use semantic versioning

* Build SM 5.0 code

* Don't link tests with NVML

* Added Debian packaging files

* Update deb packaging scripts

* fix a typo in README.md

* Fixed deadlock in back-to-back reduce_scatters.

Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added support for more than 8 GPUs.

Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Moved tests to separate dir and improved MPI test

test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Enabled support for char type to be unsigned.

GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added NCCL error checking to tests.

Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Libwrap checks for LIB.so.1 if LIB.so not found

Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed buffer overflow in ReduceOrCopy

Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed useRemoteRecv consistency issue.

Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Updated package version, added manpage

* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

https://github.com/NVIDIA/nccl/issues/287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to https://github.com/NVIDIA/nccl/issues/560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes https://github.com/NVIDIA/nccl/issues/726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* initial checkin

* fix the build issue when cuda version larger than 12

* enable ncv4 test scenarios

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* modify the test script to support training and inference test scenarios

* Prevent WR index truncation in the InfiniBand transport plugin

* fix build break of previous FI

* remove test related assert

* Create msccl-algorithms folder by default

* fix the memory access violation issue when using simple protocol

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net>
Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net>
Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>

* Msccl v2.18 (#10)

* Use semantic versioning

* Build SM 5.0 code

* Don't link tests with NVML

* Added Debian packaging files

* Update deb packaging scripts

* fix a typo in README.md

* Fixed deadlock in back-to-back reduce_scatters.

Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added support for more than 8 GPUs.

Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Moved tests to separate dir and improved MPI test

test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Enabled support for char type to be unsigned.

GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added NCCL error checking to tests.

Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Libwrap checks for LIB.so.1 if LIB.so not found

Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed buffer overflow in ReduceOrCopy

Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed useRemoteRecv consistency issue.

Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Updated package version, added manpage

* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was no…
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Jan 12, 2024
* Use semantic versioning

* Build SM 5.0 code

* Don't link tests with NVML

* Added Debian packaging files

* Update deb packaging scripts

* fix a typo in README.md

* Fixed deadlock in back-to-back reduce_scatters.

Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added support for more than 8 GPUs.

Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Moved tests to separate dir and improved MPI test

test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Enabled support for char type to be unsigned.

GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added NCCL error checking to tests.

Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Libwrap checks for LIB.so.1 if LIB.so not found

Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed buffer overflow in ReduceOrCopy

Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed useRemoteRecv consistency issue.

Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Updated package version, added manpage

* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

NVIDIA/nccl#287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to NVIDIA/nccl#560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc19.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes NVIDIA/nccl#649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes NVIDIA/nccl#726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* Prevent WR index truncation in the InfiniBand transport plugin

* Fix inter-node NVLS graph search

We were passing a net ID instead of a gpu index, which could cause
crashes if those were unrelated (and they usually are).

Issue #931

* 2.18.5-1

Fix NVLS search (issue #931).
Increase max IB NICs to 32.
Fix inconsistent device ordering (issue #820).
Try to use different devices for different GPUs in systems with
more than one NIC per GFU.

* Fix cudaMemcpyAsync bug

We are trying to use the copy result of first cudaMemcpyAsync in the
second cudaMemcpyAsync without sync in between. This patch fixes it
by allocating a CPU side array to cache device side addr so that we
can avoid this consecutive cuda mem copy.

Fixes #957

* 2.19.1-1

Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.

* 2.19.3-1

H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.

* 2.18.6-1

* 2.19.4-1

Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.

* Fix use of CPUID overwriting registers in use.

CPUID writes to EAX, EBX, ECX, and EDX so the inline-asm must state that.
Otherwise currently in-use register might get overwritten which may
cause all kinds of failures like segfaults or wrong results.

Alternatively `__cpuid` can be used which avoids this and related issues.
So do that as suggested in the GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

* resolve some msccl compatibility issue

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: Alexander Grund <[email protected]>
Andyli1007 added a commit to Azure/msccl-executor-nccl that referenced this pull request Feb 18, 2024
* Msccl v2.19 integrate (#40)

* Use semantic versioning

* Build SM 5.0 code

* Don't link tests with NVML

* Added Debian packaging files

* Update deb packaging scripts

* fix a typo in README.md

* Fixed deadlock in back-to-back reduce_scatters.

Change-Id: I92d32b15e516a39710b676aee692ae9b70638937
Reviewed-on: http://git-master/r/935458
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added support for more than 8 GPUs.

Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad
Reviewed-on: http://git-master/r/935459
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Moved tests to separate dir and improved MPI test

test sources moved to test/ directory.
MPI test displays PASS/FAIL and returns code accordingly.

Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b
Reviewed-on: http://git-master/r/936086
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Enabled support for char type to be unsigned.

GCC on POWER arch defines char type as unsigned.

Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae
Reviewed-on: http://git-master/r/999614
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Added NCCL error checking to tests.

Also cleaned up makefile so that tests and lib are not built unnecessarily.

Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262
Reviewed-on: http://git-master/r/999627
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Libwrap checks for LIB.so.1 if LIB.so not found

Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5
Reviewed-on: http://git-master/r/1000162
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed buffer overflow in ReduceOrCopy

Bug caused AllGathers and ReduceScatters of less than
8 bytes to fail in certain cases.

Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b
Reviewed-on: http://git-master/r/1011505
Reviewed-by: Przemek Tredak <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Fixed useRemoteRecv consistency issue.

Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2
Reviewed-on: http://git-master/r/1013543
Reviewed-by: Cliff Woolley <[email protected]>
Tested-by: Przemek Tredak <[email protected]>

* Updated package version, added manpage

* Moved release files to proper area

Bumping a version; building for 7.5

* Moved to pbuilder

* Preparing for pbuild

* Added compute 5.3

* Added files via upload

* Delete libnccl-dev_1.1.1+cuda75_amd64.deb

* Delete libnccl1_1.1.1+cuda75_amd64.deb

* Use arch=5.3 as well

* Version with . 7.5

* fixed version format

* Removing Tegra

* Enable compilation with old g++ when the default g++ is not supported (+5.0)

* Add --no-as-needed to make sure that cudart library gets liked

* Fix MPI test usage

Only display usage from rank 0 and exit instead of continuing (and seg fault).

* Fix random deadlock during ncclCommInitRank.

* Fix readme to reflect the new test paths

* Moved no-as-needed flag to link rule.

Avoids link errors for tests linked with nvcc.

* Fixed install location, new .deb version

* Fixed version in ChangeLog

* Makefile improvements

 - Use standard CXX env var
 - Permit redefinition of more env
 - Separate lib from tests

* Removing unneeded includes

* Better name for GENCODE

* Bump to 1.2.2

* Gencodes changed to NV recommended

* Changed CURAND generator to work on a wider set of platforms.

* Make NCCL collectives work on communicators with only one rank

* Only call the CUDA runtime. That may fix #27.

* Updating for .deb rebuild

* Include link to blog post in README.md

* Rework debian packaging

* Fix make install to use BUILDDIR

* Move deb to build directory

* Packaging : Generate shlibs.local

* Increased version to 1.2.3

* Add a debug level to NCCL and CUDA versions at init

* Fix version number

* Improved Deb generation

* Fixed redundant contexts in multi-process apps

Change-Id: If787014450fd281304f0c7baf01d25963e40905d

* Remove unneeded deb build script

* link library with -lrt; otherwise there is undefined reference to shm_open

* pass devlist as const int* rather than int* in ncclCommInitAll

* Updated LICENCE.txt

* Update LICENSE.txt

* Fix MPI test path

* Add profiling API

* Heavy code refactoring to remove a lot of code in collectives (~1000 lines).

Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.

* Make tests check for deltas and report bandwidth

* Add scan tests

* Improved allreduce segmentation for small sizes

* NVML (libwrap) : import the needed definitions

* Fix primitives function prototype

* Bump to 1.3.1

* Add Fortran bindings

* Add Copyright header to Fortran bindings source files

* Remove irrelevant output from ncclReduce Fortran tests

* Add a static library target "staticlib" to the Makefile.

Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.

* Replace min BW by average BW in tests

* 1.3.2 release

Broadcast tuning
Better checking of inputs
Copy/reduce code simplification

* Adding missing file

* Fix 1.3.2 compilation

* Qualify nullptr_t with std::.

* Fix crash in Reduce when non-root ranks have invalid recvbuff

* Fix copy/paste typo in error message

* Only enable peer access for ring neighbors.

This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.

* Bumping version to 1.3.3

* Fix compilation error when compiling with 'clang -x cuda'.

Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.

* Added Pascal nvcc flags, bumped version

* Add support for CUDA9 half semantics

* Update README to link to NCCL2

* Update README to link to NCCL2   #2

* Update README to link to NCCL2 part 3

* Update README to link to NCCL2

* fix tests on maxwell

* 2.3.5-5

Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .

* Fix nccl-tests all_reduce_perf path

It's `all_reduce_perf` not `allreduce_perf`

* 2.3.7-1

Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.

* Add install target

Fix issue #145

* Add instructions to install packaging toolchain

Address #143 and #150 : debuild not installed.

* Add official builds download link

* Generate nccl.h in build instead of src

Generating nccl.h in src makes source directories dirty after builds.

* Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156)

* Add support for external network.

Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network

* Make network isend/irecv non blocking

* Improve net API description

* Rework SYSCHECK macros to better handle retries.

SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.

Also rework the socket connection code and improve error reporting.

* Rework shared memory code to use SYSCHECK macros.

This is to handle EINTR/EGAIN properly (issue #137), and also
make the code consistent with the rest.

Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.

* Fixed some compilation errors when TRACE=1 set

* Improve INFO message when external network is not found.

Fix #162

* Add NCCL_NET flag to many debug lines.

* Fix GPU Direct RDMA detection.

Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.

* Remove error logging from a normal path

When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)

* Fix dummy plugin

* Fix #163 : remove warnings

* Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.

* Two temporary workarounds for cuda-clang issues.

* Qualify nullptr_t with std::

* Replace CUDA_VERSION by CUDART_VERSION

* Fix memory leak in bootstrapRoot()

* 2.4.2-1

Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.

* Fix crash during shared memory creation (#185)

The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <[email protected]>

* Fix share memory collision in multi-communicator case.

Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.

* NCCL 2.4.6-1

    Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc

* Add pkgconfig file (#190)

* Allow CUDA runtime library selection (#220)

Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.

* NCCL 2.4.7-1

    Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection

* Update debian dependencies in README (#228)

'fakeroot' is needed for building deb packages

* Fix out-of-bounds read in ncclStrToCpuset (#233)

The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <[email protected]>

* 2.4.8-1

Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly

* Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)

Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.

* Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization

* Fix #224: prevent number of IB devices from going out of bound

* Fix NIC distances for 11+ NICs

* Refine RPM package building spec file.

Add /sbin/ldconfig into RPM package install operations.

* Make use of SO_REUSEPORT conditional

Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.

* Updated PR#196 to use a common hash function

* 2.5.6-1 (#255)

Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP

* Fix clang build (#271)

Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.

* Fix clang compilation

* 2.5.6-2

Fix PPC64 Debian packaging

* Fix clang build (#274)

The attribute is called `optnone`, not `noopt`.

* [build] Allow setting CXXFLAGS on the command line

* [topology] remove NET links when trimming system

This fixes a memory leak.

* 2.5.7-1

* Fix Allgather operations above 4G with multiple GPUs per process.

Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.

* Check return code for Flush operation

Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.6.4-1

Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.

* Fix bug #307 : wrong NIC selection on the reduction tree.

The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.

* Fix wrong variable name "slice" to "chunk"

NVIDIA/nccl#287

* Improve robustness of PCI detection

Fallback to default values when class/speed is unknown.

* Fix crash when only a subset of GPUs are visible within a container.

Fixes #326.

* 2.7.3-1

Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).

* 2.7.5-1

Minor fixes for A100 platforms.
Add a WARN for invalid GroupEnd call.

* 2.7.6-1

Fix crash when NVswitch is not visible inside a VM.

* Fix build action order

Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB).
As there were no dep files during the first build, Make may kick off source
compilation before nccl.h got generated, which leads to occasional build
failures on systems with high core count. The build failure could be
reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule.

* 2.7.8-1

Fix collective mismatch error when using ncclSend/ncclRecv

* Don't require NIC devices to have specific PCI class

If a PCI node is the parent of a NIC, treat it as such, regardless of
the PCI class code for the device. This allows non-traditional devices
to act as NICs via the net plugin mechanism.

For consistency, treat GPUs similarly.

* Setting type when gpu sub node is discovered

* Make sure proxy threads inherit the CPU affinity.

* Fix affinity move

* fix proxyArgs for trace log

* 2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

* x86: Add CPU detection for Zhaoxin processors

Signed-off-by: Jonas Zhou <[email protected]>

* 2.8.4-1

Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.

* 2.9.6-1

Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.

* 2.9.8-1

Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.

* 2.9.9-1

Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)

* 2.10.3-1

Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.

* Fix to NVIDIA/nccl#560

ncclGroup's containing operations of mixed datatype, element, or collective
would induce crash.

* 2.11.4-1

Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum).
Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration.
Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so".
Fix memory leak of NVB connections.
Fix topology detection of IB Virtual Functions (SR-IOV).

* Fix Collnet when GDR is disabled

* Fix compilation failure in "src/enqueue.cc" on older GCC because of
missing `#include <cstring>`.

* Perform `busIdToInt64` on the stack.

I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is
called thousands of times, making the log output unintelligible.
Fortunately, this function can be implemented without heap allocations.

* Improve warning message about truncated messages

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.

* Add env NCCL_NET_DISABLE_INTRA

Disable NET transport for intra-node communication by setting the env to 1
It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable

* Build fastsocket plugin from ext-net

* remove unused basePath

* Revert "remove unused basePath"

This reverts commit 445bc19.

* Fix ext-net/google-fastsocket build

* Split IB parameter sanity check into two parts

First part on collective mismatch, second part on internal errors

* 2.12.7-1

Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.

* Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes NVIDIA/nccl#649

* Remove unnecessary newline in plugin logging

Signed-off-by: Felix Abecassis <[email protected]>

* Fix typo in net_ib.cc

* Display host name instead of numeric IP when referring to a peer

For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"

* Fix merging error

* 2.12.10-1

Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.

* Update Makefile to install static library.

Make sure make install also installs the static library. 
Fixes #662

* 2.12.12-1

Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.

* 2.13.4-1

Optimize CUDA graph launch; avoid launching a CPU callback for
intra-node operations.
Simplify kernel common code to improve the latency of send/recv
operations.
Strengthen CUDA streams semantics.
Change NET API to v6, to add dmabuf support.
Add ncclGetLastError() function.
Add ncclRemoteError code and use it for remote network errors.
Support the use of a different NCCL_NET parameter per communicator.
Add support for SHM and P2P transfers using cudaMemcpy.

* fix NCCL_DEBUG_FILE

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

Differential Revision: D38415208

fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

* 2.14.3-1

Add support for improved fault tolerance: non-blocking mode, new
init function with config, and ncclCommFinalize function.
Reintroduce collnet+chain algorithm, alongside collnet+direct.
Add LL protocol for intra-node P2P (on by default) and network
communication (off by default).
Use network instead of shared memory when performance is better.
Fix: wait for CUDA graph destroy before destroying comm with linked
graph resources.
Remove aggressive polling during enqueue.
Fix DMABUF fallback on MOFED 5.4 and earlier.

* Fix intermittent 11.6 builds: generate unique .cu file for each object file

* address review comments

* Fix potential deadlock during init in multi-thread mode.

Make sure all calls calling cudaMalloc (including devCommSetup) are
called before the last bootstrapBarrier. That way, we avoid calls to
cudaMalloc be blocked by a NCCL kernel launched on another GPU by
another thread which completed init faster.

Resolve #623.

* Use compatibility shim only with static cudart

Closes issue 658

* 2.15.1-1

Add support for H100 (sm90).
Make sure NCCL kernel honor user stream priorities.

* Fixes a double-free in the error path of ncclCommInitAll.

Fixes NVIDIA/nccl#726

* 2.15.5-1

Fix crash with CollnetChain on some node topologies
Fix hang when interleaving the capture of different graphs
Fix hang during init in multi-threaded mode
Fix potential data corruption with LL128 protocol on unaligned buffers.
Fix CPU usage during preconnect
Fixes double-free in the error path for ncclCommInitAll
Workaround hang on H100 with Ring/LL128 on 2 GPUs.

* Add documentation for NCCL NET plugins

Also repurpose dummy plugin as example, including headers and
compat layers from v6 to v2.

* Fix google-fastsocket plugin build

* 2.16.2-1

Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.

* Fix maximum handle size for NCCL Net v4 API

NCCL Net v4 supports a maximum handle size of 64 bytes whereas the
ext-net example header files set it for NCCL Net v3. Since,
`aws-ofi-nccl` plugin plans to follow the example header files, fix it
here.

Signed-off-by: Rashika Kheria <[email protected]>

* 2.16.5-1

Add support for 400Gbit NDR network adapters (CX7)
Handle EINTR in socket poll() function
Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead
Resource cleanup fixes
Fix double free in case of init failure
Fix crash in ncclCommAbort
Revert AMD speed commit

* 2.17.1-1

Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only).
Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName.
Enable LL128 when we use PXN to close rings.
NVTX3 includes update.
Fix crash when one CollNet (SHARP) rail fails to initialize.

* Shutdown socket before close in ncclSocketClose()

* Add a comment to shutdown() in ncclSocketClose

* 2.18.1-1

Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.

* 2.18.3-1

Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC.
Fix hang with Collnet on bfloat16 on systems with less than one NIC
per GPU.
Fix long initialization time.
Fix data corruption with Collnet when mixing multi-process and
multi-GPU per process.
Fix crash when shared memory creation fails.
Fix Avg operation with Collnet/Chain.
Fix performance of alltoall at scale with more than one NIC per GPU.
Fix performance for DGX H800.
Fix race condition in connection progress causing a crash.
Fix network flush with Collnet.
Fix performance of aggregated allGather/reduceScatter operations.
Fix PXN operation when CUDA_VISIBLE_DEVICES is set.
Fix NVTX3 compilation issues on Debian 10.

* Prevent WR index truncation in the InfiniBand transport plugin

* Fix inter-node NVLS graph search

We were passing a net ID instead of a gpu index, which could cause
crashes if those were unrelated (and they usually are).

Issue #931

* 2.18.5-1

Fix NVLS search (issue #931).
Increase max IB NICs to 32.
Fix inconsistent device ordering (issue #820).
Try to use different devices for different GPUs in systems with
more than one NIC per GFU.

* Fix cudaMemcpyAsync bug

We are trying to use the copy result of first cudaMemcpyAsync in the
second cudaMemcpyAsync without sync in between. This patch fixes it
by allocating a CPU side array to cache device side addr so that we
can avoid this consecutive cuda mem copy.

Fixes #957

* 2.19.1-1

Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes #936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.

* 2.19.3-1

H800/H100 fixes and tuning.
Re-enable intra-process direct pointer buffer access when CUMEM is
enabled.

* 2.18.6-1

* 2.19.4-1

Split transport connect phase into multiple steps to avoid port
exhaustion when connecting alltoall at large scale. Defaults to 128
peers per round.
Fix memory leaks on CUDA graph capture.
Fix alltoallv crash on self-sendrecv.
Make topology detection more deterministic when PCI speeds are not
available (fix issue #1020).
Properly close shared memory in NVLS resources.
Revert proxy detach after 5 seconds.
Add option to print progress during transport connect.
Add option to set NCCL_DEBUG to INFO on first WARN.

* Fix use of CPUID overwriting registers in use.

CPUID writes to EAX, EBX, ECX, and EDX so the inline-asm must state that.
Otherwise currently in-use register might get overwritten which may
cause all kinds of failures like segfaults or wrong results.

Alternatively `__cpuid` can be used which avoids this and related issues.
So do that as suggested in the GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513

* resolve some msccl compatibility issue

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: Alexander Grund <[email protected]>

* fix some integration bugs

* fix the sync flag rlease issue for msccl

* fix the msccl resouce issue

* fix the  correctless issue of op max

* remove indent before #if

* add alltoall support

* remove fp8 from nvls scenario

---------

Signed-off-by: Felix Abecassis <[email protected]>
Signed-off-by: Rashika Kheria <[email protected]>
Signed-off-by: Jonas Zhou <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Nathan Luehr <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Kaiyu Yang <[email protected]>
Co-authored-by: Sylvain Jeaugey <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Pau Farré <[email protected]>
Co-authored-by: Adam Paszke <[email protected]>
Co-authored-by: jiakai <[email protected]>
Co-authored-by: Cliff Woolley <[email protected]>
Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]>
Co-authored-by: Peter Jin <[email protected]>
Co-authored-by: Chad Whipkey <[email protected]>
Co-authored-by: Ilya Biryukov <[email protected]>
Co-authored-by: sclarkson <[email protected]>
Co-authored-by: Obihörnchen <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Alex Sergeev <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Christian Sigg <[email protected]>
Co-authored-by: Rong Ou <[email protected]>
Co-authored-by: Cao Zongyan <[email protected]>
Co-authored-by: Gustavo Alvarez <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: Rajat Chopra <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Hirochika Asai <[email protected]>
Co-authored-by: Luke Yeager <[email protected]>
Co-authored-by: Rashika Kheria <[email protected]>
Co-authored-by: aokomoriuta <[email protected]>
Co-authored-by: Riatre Foo <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Jack Snyder <[email protected]>
Co-authored-by: xietingwew <[email protected]>
Co-authored-by: Jonas Zhou <[email protected]>
Co-authored-by: John Bachan <[email protected]>
Co-authored-by: Chris Jones <[email protected]>
Co-authored-by: Ke Wen <[email protected]>
Co-authored-by: Chang Lan <[email protected]>
Co-authored-by: void-main <[email protected]>
Co-authored-by: Felix Abecassis <[email protected]>
Co-authored-by: Christopher Hesse <[email protected]>
Co-authored-by: Ching-Hsiang Chu <[email protected]>
Co-authored-by: Jane Xu <[email protected]>
Co-authored-by: Kaiming Ouyang <[email protected]>
Co-authored-by: David Addison <[email protected]>
Co-authored-by: Dmitrii Gabor <[email protected]>
Co-authored-by: Alexander Grund <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants