-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create msccl-algorithms folder by default #11
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Andyli1007
approved these changes
Jul 17, 2023
Andyli1007
added a commit
that referenced
this pull request
Sep 11, 2023
* Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * Create msccl-algorithms folder by default (#11) * enable make install & deb package for msccl (#13) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#5) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files …
Andyli1007
added a commit
that referenced
this pull request
Sep 11, 2023
* Fixed deadlock in back-to-back reduce_scatters. Change-Id: I92d32b15e516a39710b676aee692ae9b70638937 Reviewed-on: http://git-master/r/935458 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added support for more than 8 GPUs. Change-Id: Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad Reviewed-on: http://git-master/r/935459 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Moved tests to separate dir and improved MPI test test sources moved to test/ directory. MPI test displays PASS/FAIL and returns code accordingly. Change-Id: I058ebd1bd5202d8f38cc9787898b2480100c102b Reviewed-on: http://git-master/r/936086 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Enabled support for char type to be unsigned. GCC on POWER arch defines char type as unsigned. Change-Id: Ic143cb058fe42414b1f6f1f45b02132c837726ae Reviewed-on: http://git-master/r/999614 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Added NCCL error checking to tests. Also cleaned up makefile so that tests and lib are not built unnecessarily. Change-Id: Ia0c596cc2213628de2f066be97615c09bb1bb262 Reviewed-on: http://git-master/r/999627 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Libwrap checks for LIB.so.1 if LIB.so not found Change-Id: I6f07f887f828cb2259dcfd496a2ad707db898cf5 Reviewed-on: http://git-master/r/1000162 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed buffer overflow in ReduceOrCopy Bug caused AllGathers and ReduceScatters of less than 8 bytes to fail in certain cases. Change-Id: I33e1beb50805bfdb457ae16a90e3f91c1b283b9b Reviewed-on: http://git-master/r/1011505 Reviewed-by: Przemek Tredak <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Fixed useRemoteRecv consistency issue. Change-Id: Ib093a8dc3bb093eddc89dad81d3fffa53c03a6a2 Reviewed-on: http://git-master/r/1013543 Reviewed-by: Cliff Woolley <[email protected]> Tested-by: Przemek Tredak <[email protected]> * Updated package version, added manpage * Moved release files to proper area Bumping a version; building for 7.5 * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * Shutdown socket before close in ncclSocketClose() * Add a comment to shutdown() in ncclSocketClose * 2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies. * initial checkin * fix the build issue when cuda version larger than 12 * enable ncv4 test scenarios * 2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10. * modify the test script to support training and inference test scenarios * Prevent WR index truncation in the InfiniBand transport plugin * fix build break of previous FI * remove test related assert * Create msccl-algorithms folder by default * Create msccl-algorithms folder by default (#11) * enable make install & deb package for msccl (#13) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#5) * Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files …
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.