NCCL 2.13 Preview #682

sjeaugey · 2022-05-24T19:17:00Z

Pushing this branch as a preview of NCCL 2.13. Feel free to give it a try and report issues on this PR.

@rashikakheria @changlan this is introducing the v6 NET plugin API which includes support for dmabuf. There is still a v4->v5->v6 compat layer.

Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy.

rajachan · 2022-07-14T14:52:38Z

Is there a driver+device+runtime stack combination net_ib requires for use with a NVIDIA GPU as a dmabuf exporter?

spotluri · 2022-07-21T22:19:11Z

@rajachan these are the requirements
Toolkit version: 11.7+
Driver version: 515.43.04+ w/ OpenRM (default for Turing+)
Device: Turing+

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Squashed commit of the following: commit 28189e2 Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 29 04:27:46 2022 -0800 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves NVIDIA#555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. commit 614b49f Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 22 02:13:13 2022 -0800 Fix google-fastsocket plugin build commit 55b1d8a Author: Sylvain Jeaugey <[email protected]> Date: Mon Nov 21 06:03:27 2022 -0800 Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. commit 2f4cb87 Merge: d128d62 cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 01:15:22 2022 -0700 Merge tag 'v2.15.5-1' commit cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 00:55:55 2022 -0700 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. commit d128d62 Merge: 2401f4a da8152e Author: Sylvain Jeaugey <[email protected]> Date: Fri Oct 7 11:00:26 2022 -0700 Merge tag 'v2.15.1-1' commit 2401f4a Author: John Bachan <[email protected]> Date: Mon Oct 3 17:02:15 2022 -0700 Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA#726 commit da8152e Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:31:13 2022 -0700 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. commit 99c28f2 Merge: 78313a6 ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:24:41 2022 -0700 Merge remote-tracking branch 'origin/master' commit 78313a6 Author: Cliff Woolley <[email protected]> Date: Fri Aug 26 15:00:18 2022 -0700 Use compatibility shim only with static cudart Closes issue 658 commit ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Thu Sep 22 01:04:50 2022 -0700 Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve NVIDIA#623. commit f89fd47 Author: Jane Xu <[email protected]> Date: Wed Sep 14 11:16:17 2022 -0400 address review comments commit 79fb032 Author: Jane Xu <[email protected]> Date: Tue Sep 13 16:05:21 2022 -0400 Fix intermittent 11.6 builds: generate unique .cu file for each object file commit c4e2aa6 Author: Sylvain Jeaugey <[email protected]> Date: Thu Aug 18 02:53:17 2022 -0700 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. commit e1d9b27 Author: Ching-Hsiang Chu <[email protected]> Date: Wed Aug 3 20:47:40 2022 -0700 fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Summary: Squashed commit of the following: commit 28189e2 Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 29 04:27:46 2022 -0800 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves NVIDIA#555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. commit 614b49f Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 22 02:13:13 2022 -0800 Fix google-fastsocket plugin build commit 55b1d8a Author: Sylvain Jeaugey <[email protected]> Date: Mon Nov 21 06:03:27 2022 -0800 Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. commit 2f4cb87 Merge: d128d62 cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 01:15:22 2022 -0700 Merge tag 'v2.15.5-1' commit cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 00:55:55 2022 -0700 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. commit d128d62 Merge: 2401f4a da8152e Author: Sylvain Jeaugey <[email protected]> Date: Fri Oct 7 11:00:26 2022 -0700 Merge tag 'v2.15.1-1' commit 2401f4a Author: John Bachan <[email protected]> Date: Mon Oct 3 17:02:15 2022 -0700 Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA#726 commit da8152e Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:31:13 2022 -0700 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. commit 99c28f2 Merge: 78313a6 ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:24:41 2022 -0700 Merge remote-tracking branch 'origin/master' commit 78313a6 Author: Cliff Woolley <[email protected]> Date: Fri Aug 26 15:00:18 2022 -0700 Use compatibility shim only with static cudart Closes issue 658 commit ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Thu Sep 22 01:04:50 2022 -0700 Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve NVIDIA#623. commit f89fd47 Author: Jane Xu <[email protected]> Date: Wed Sep 14 11:16:17 2022 -0400 address review comments commit 79fb032 Author: Jane Xu <[email protected]> Date: Tue Sep 13 16:05:21 2022 -0400 Fix intermittent 11.6 builds: generate unique .cu file for each object file commit c4e2aa6 Author: Sylvain Jeaugey <[email protected]> Date: Thu Aug 18 02:53:17 2022 -0700 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. commit e1d9b27 Author: Ching-Hsiang Chu <[email protected]> Date: Wed Aug 3 20:47:40 2022 -0700 fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 Pull Request resolved: #17 Test Plan: These commands are to be applied to the whole stack rather than this specific commit: ``` buck2 run fbcode//mode/opt -c hpc_comms.use_nccl=exp fbcode//param_bench/train/comms/cpp/nccl-tests:nccl_tests_launcher -- --launcher mast --hw tc_any --nnode 2 --collective allreduce,alltoall,allgather,reducescatter --nccl-args "-b 4 -e 1G -f 2 -z 1" --dp ai_system_sw_hw_co-design_cws --entitlement codesign ``` https://www.internalfb.com/mast/job/torchx-nccl-test-allreduce-alltoall-allgather-reducescatter-3301bc Reviewed By: agangidi53 Differential Revision: D42194038 Pulled By: spease-fb fbshipit-source-id: e2cde6c44bcb8494c9dd02e926938f27bbc8f43b

* Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" NVIDIA/nccl#287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to NVIDIA/nccl#560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc19. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes NVIDIA/nccl#649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA/nccl#726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" NVIDIA/nccl#287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to NVIDIA/nccl#560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc19. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes NVIDIA/nccl#649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA/nccl#726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario * remove test related asset --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" NVIDIA/nccl#287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to NVIDIA/nccl#560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc19. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes NVIDIA/nccl#649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA/nccl#726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario * remove test related asset * update the readme --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario * remove test related asset --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#4) * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add N…

* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario * remove test related asset --------- * Enable msccl capability (#4) * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add N… Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net>

* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#5) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with ea…

* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#5) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distin…

* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#5) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will…

* Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * SECURITY.md committed * README.md committed * SUPPORT.md committed * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * Create msccl-algorithms folder by default * enable make install & deb package for msccl * include string header file for compiler compability * resolve the build capability issue for arch below 800 * fix the logic issue of chunks calculation for cpu proxy * fix the memory access violation issue when using simple protocol * fix the work index issue when using cuda graph (#9) * Msccl v2.18 (#2) * Fixed bug in MPI initialization. * Use semantic versioning * Build SM 5.0 code * Don't link tests with NVML * Added Debian packaging files * Update deb packaging scripts * fix a typo in README.md * Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * fix the memory access violation issue when using simple protocol --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by: Microsoft Open Source <[email protected]> Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: root <root@superbench000008.5czzseio4l3u3nxaefzoshwirc.jx.internal.cloudapp.net>

* fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * SECURITY.md committed * README.md committed * SUPPORT.md committed * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * Create msccl-algorithms folder by default * enable make install & deb package for msccl * include string header file for compiler compability * resolve the build capability issue for arch below 800 * fix the logic issue of chunks calculation for cpu proxy * fix the memory access violation issue when using simple protocol * fix the work index issue when using cuda graph (#9) * Msccl v2.18 (#2) * Fixed bug in MPI initialization. * Use semantic versioning * Build SM 5.0 code * Don't link tests with NVML * Added Debian packaging files * Update deb packaging scripts * fix a typo in README.md * Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * fix the memory access violation issue when using simple protocol --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000000.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: root <root@liand-h100-validation-vmss00000B.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> * Msccl v2.18 (#10) * Use semantic versioning * Build SM 5.0 code * Don't link tests with NVML * Added Debian packaging files * Update deb packaging scripts * fix a typo in README.md * Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was no…

* Use semantic versioning * Build SM 5.0 code * Don't link tests with NVML * Added Debian packaging files * Update deb packaging scripts * fix a typo in README.md * Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" NVIDIA/nccl#287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to NVIDIA/nccl#560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc19. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes NVIDIA/nccl#649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA/nccl#726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * Prevent WR index truncation in the InfiniBand transport plugin * Fix inter-node NVLS graph search We were passing a net ID instead of a gpu index, which could cause crashes if those were unrelated (and they usually are). Issue #931 * 2.18.5-1 Fix NVLS search (issue #931). Increase max IB NICs to 32. Fix inconsistent device ordering (issue #820). Try to use different devices for different GPUs in systems with more than one NIC per GFU. * Fix cudaMemcpyAsync bug We are trying to use the copy result of first cudaMemcpyAsync in the second cudaMemcpyAsync without sync in between. This patch fixes it by allocating a CPU side array to cache device side addr so that we can avoid this consecutive cuda mem copy. Fixes #957 * 2.19.1-1 Add local user buffer registration for NVLink SHARP. Add tuning plugin support. Increase net API to v7 to allow for device-side packet reordering; remove support for v4 plugins. Add support for RoCE ECE. Add support for C2C links. Better detect SHM allocation failures to avoid crash with Bus Error. Fix missing thread unlocks in bootstrap (Fixes #936). Disable network flush by default on H100. Move device code from src/collectives/device to src/device. * 2.19.3-1 H800/H100 fixes and tuning. Re-enable intra-process direct pointer buffer access when CUMEM is enabled. * 2.18.6-1 * 2.19.4-1 Split transport connect phase into multiple steps to avoid port exhaustion when connecting alltoall at large scale. Defaults to 128 peers per round. Fix memory leaks on CUDA graph capture. Fix alltoallv crash on self-sendrecv. Make topology detection more deterministic when PCI speeds are not available (fix issue #1020). Properly close shared memory in NVLS resources. Revert proxy detach after 5 seconds. Add option to print progress during transport connect. Add option to set NCCL_DEBUG to INFO on first WARN. * Fix use of CPUID overwriting registers in use. CPUID writes to EAX, EBX, ECX, and EDX so the inline-asm must state that. Otherwise currently in-use register might get overwritten which may cause all kinds of failures like segfaults or wrong results. Alternatively `__cpuid` can be used which avoids this and related issues. So do that as suggested in the GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513 * resolve some msccl compatibility issue --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: Alexander Grund <[email protected]>

* Msccl v2.19 integrate (#40) * Use semantic versioning * Build SM 5.0 code * Don't link tests with NVML * Added Debian packaging files * Update deb packaging scripts * fix a typo in README.md * Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" NVIDIA/nccl#287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to NVIDIA/nccl#560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc19. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes NVIDIA/nccl#649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA/nccl#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA/nccl#726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * Prevent WR index truncation in the InfiniBand transport plugin * Fix inter-node NVLS graph search We were passing a net ID instead of a gpu index, which could cause crashes if those were unrelated (and they usually are). Issue #931 * 2.18.5-1 Fix NVLS search (issue #931). Increase max IB NICs to 32. Fix inconsistent device ordering (issue #820). Try to use different devices for different GPUs in systems with more than one NIC per GFU. * Fix cudaMemcpyAsync bug We are trying to use the copy result of first cudaMemcpyAsync in the second cudaMemcpyAsync without sync in between. This patch fixes it by allocating a CPU side array to cache device side addr so that we can avoid this consecutive cuda mem copy. Fixes #957 * 2.19.1-1 Add local user buffer registration for NVLink SHARP. Add tuning plugin support. Increase net API to v7 to allow for device-side packet reordering; remove support for v4 plugins. Add support for RoCE ECE. Add support for C2C links. Better detect SHM allocation failures to avoid crash with Bus Error. Fix missing thread unlocks in bootstrap (Fixes #936). Disable network flush by default on H100. Move device code from src/collectives/device to src/device. * 2.19.3-1 H800/H100 fixes and tuning. Re-enable intra-process direct pointer buffer access when CUMEM is enabled. * 2.18.6-1 * 2.19.4-1 Split transport connect phase into multiple steps to avoid port exhaustion when connecting alltoall at large scale. Defaults to 128 peers per round. Fix memory leaks on CUDA graph capture. Fix alltoallv crash on self-sendrecv. Make topology detection more deterministic when PCI speeds are not available (fix issue #1020). Properly close shared memory in NVLS resources. Revert proxy detach after 5 seconds. Add option to print progress during transport connect. Add option to set NCCL_DEBUG to INFO on first WARN. * Fix use of CPUID overwriting registers in use. CPUID writes to EAX, EBX, ECX, and EDX so the inline-asm must state that. Otherwise currently in-use register might get overwritten which may cause all kinds of failures like segfaults or wrong results. Alternatively `__cpuid` can be used which avoids this and related issues. So do that as suggested in the GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112513 * resolve some msccl compatibility issue --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: Alexander Grund <[email protected]> * fix some integration bugs * fix the sync flag rlease issue for msccl * fix the msccl resouce issue * fix the correctless issue of op max * remove indent before #if * add alltoall support * remove fp8 from nvls scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Kaiyu Yang <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: Kaiming Ouyang <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Dmitrii Gabor <[email protected]> Co-authored-by: Alexander Grund <[email protected]>

sjeaugey force-pushed the v2.13 branch from 7638459 to 19ab67d Compare July 14, 2022 07:14

sjeaugey merged commit 19ab67d into master Jul 14, 2022

sjeaugey deleted the v2.13 branch July 15, 2022 07:42

This was referenced Aug 4, 2022

fix NCCL_DEBUG_FILE facebookresearch/nccl#1

Closed

fix NCCL_DEBUG_FILE behavior #711

Merged

kingchc mentioned this pull request Aug 18, 2022

fix NCCL_DEBUG_FILE behavior facebookresearch/nccl#2

Closed

Andyli1007 mentioned this pull request Jan 12, 2024

Msccl v2.19 integrate (#40) Azure/msccl-executor-nccl#41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL 2.13 Preview #682

NCCL 2.13 Preview #682

sjeaugey commented May 24, 2022

rajachan commented Jul 14, 2022

spotluri commented Jul 21, 2022

NCCL 2.13 Preview #682

NCCL 2.13 Preview #682

Conversation

sjeaugey commented May 24, 2022

rajachan commented Jul 14, 2022

spotluri commented Jul 21, 2022