Skip to content

Commit cc07513

Browse files
Andyli1007borisfomPau Farréapaszkesjeaugey
authored
Enable msccl capability (#5)
* Enable msccl capability (#1) * Moved to pbuilder * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#2) * Preparing for pbuild * Added compute 5.3 * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from *.cu to *.cc * Add pkgconfig file (#190) * Allow CUDA runtime library selection (#220) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden. * NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection * Update debian dependencies in README (#228) 'fakeroot' is needed for building deb packages * Fix out-of-bounds read in ncclStrToCpuset (#233) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <[email protected]> * 2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly * Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified. * Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization * Fix #224: prevent number of IB devices from going out of bound * Fix NIC distances for 11+ NICs * Refine RPM package building spec file. Add /sbin/ldconfig into RPM package install operations. * Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable. * Updated PR#196 to use a common hash function * 2.5.6-1 (#255) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP * Fix clang build (#271) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro. * Fix clang compilation * 2.5.6-2 Fix PPC64 Debian packaging * Fix clang build (#274) The attribute is called `optnone`, not `noopt`. * [build] Allow setting CXXFLAGS on the command line * [topology] remove NET links when trimming system This fixes a memory leak. * 2.5.7-1 * Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives. * Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <[email protected]> * 2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties. * Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel. * Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287 * Improve robustness of PCI detection Fallback to default values when class/speed is unknown. * Fix crash when only a subset of GPUs are visible within a container. Fixes #326. * 2.7.3-1 Add support for A100 GPU and related platforms. Add support for CUDA 11. Add support for send/receive operations (beta). * 2.7.5-1 Minor fixes for A100 platforms. Add a WARN for invalid GroupEnd call. * 2.7.6-1 Fix crash when NVswitch is not visible inside a VM. * Fix build action order Add $(INCTARGETS) to build dependencies of %.o and $(DEVICELIB). As there were no dep files during the first build, Make may kick off source compilation before nccl.h got generated, which leads to occasional build failures on systems with high core count. The build failure could be reproduced reliably with a `sleep 5` in $(INCDIR)/nccl.h rule. * 2.7.8-1 Fix collective mismatch error when using ncclSend/ncclRecv * Don't require NIC devices to have specific PCI class If a PCI node is the parent of a NIC, treat it as such, regardless of the PCI class code for the device. This allows non-traditional devices to act as NICs via the net plugin mechanism. For consistency, treat GPUs similarly. * Setting type when gpu sub node is discovered * Make sure proxy threads inherit the CPU affinity. * Fix affinity move * fix proxyArgs for trace log * 2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node. * x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <[email protected]> * 2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README. * 2.9.6-1 Add support for CUDA graphs. Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439. Fix bootstrap issue caused by connection reordering. Fix CPU locking block. Improve CollNet algorithm. Improve performance on DGX A100 for communicators with only one GPU per node. * 2.9.8-1 Fix memory leaks. Fix crash in bootstrap error case. Fix Collnet clean-up issue. Make PCI switch vendor/device optional for XML injection. Add support for nvidia-peermem module. * 2.9.9-1 Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. Fix hang during sendrecv dynamic NVB connection establishment on cubemesh topologies. Add environment variable to only use SHARP on communicators beyond a given number of ranks. Add debug subsystem to trace memory allocations. Fix compilation with TRACE=1. (Issue #505) * 2.10.3-1 Add support for bfloat16. Add ncclAvg reduction operation. Improve performance for aggregated operations. Improve performance for tree. Improve network error reporting. Add NCCL_NET parameter to force a specific network. Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs. Fix topology detection error in WSL2. Fix proxy memory elements affinity (improve alltoall performance). Fix graph search on cubemesh topologies. Fix hang in cubemesh during NVB connections. * Fix to https://github.com/NVIDIA/nccl/issues/560 ncclGroup's containing operations of mixed datatype, element, or collective would induce crash. * 2.11.4-1 Add new API for creating a reduction operation which multiplies the input by a rank-specific scalar before doing an inter-rank summation (see: ncclRedOpCreatePreMulSum). Improve CollNet (SHARP) performance of ncclAllReduce when captured in a CUDA Graph via user buffer registration. Add environment variable NCCL_NET_PLUGIN="<suffix>" to allow user to choose among multiple NCCL net plugins by substituting into "libnccl-net-<suffix>.so". Fix memory leak of NVB connections. Fix topology detection of IB Virtual Functions (SR-IOV). * Fix Collnet when GDR is disabled * Fix compilation failure in "src/enqueue.cc" on older GCC because of missing `#include <cstring>`. * Perform `busIdToInt64` on the stack. I noticed when I enabled `NCCL_DEBUG_SUBSYS=ALLOC` that this function is called thousands of times, making the log output unintelligible. Fortunately, this function can be implemented without heap allocations. * Improve warning message about truncated messages Display hints of cause so that it would be easier for user to debug. Also change the error type from InternalError to InvalidUsage as most of time this is caused by a mismatch in collective size or env settings. * Add env NCCL_NET_DISABLE_INTRA Disable NET transport for intra-node communication by setting the env to 1 It provides an option to error out instead of falling back to NET when superior intra-node transports (P2P and SHM) are unavailable * Build fastsocket plugin from ext-net * remove unused basePath * Revert "remove unused basePath" This reverts commit 445bc1965720787aa19c8fc1c0bf62db43db2dda. * Fix ext-net/google-fastsocket build * Split IB parameter sanity check into two parts First part on collective mismatch, second part on internal errors * 2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure. * Add pthread_detach()'s for threads we never pthread_join(). Helps reduce diagnostic noise for ThreadSanitizer. Fixes https://github.com/NVIDIA/nccl/issues/649 * Remove unnecessary newline in plugin logging Signed-off-by: Felix Abecassis <[email protected]> * Fix typo in net_ib.cc * Display host name instead of numeric IP when referring to a peer For easier interpretation of debug messages like "connection closed by peer", "peer message truncated" and "peer collective mismatch" * Fix merging error * 2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors. * Update Makefile to install static library. Make sure make install also installs the static library. Fixes #662 * 2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process. * 2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy. * fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (https://github.com/NVIDIA/nccl/pull/682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 * 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. * Fix intermittent 11.6 builds: generate unique .cu file for each object file * address review comments * Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623. * Use compatibility shim only with static cudart Closes issue 658 * 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. * Fixes a double-free in the error path of ncclCommInitAll. Fixes https://github.com/NVIDIA/nccl/issues/726 * 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. * Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. * Fix google-fastsocket plugin build * 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. * Fix maximum handle size for NCCL Net v4 API NCCL Net v4 supports a maximum handle size of 64 bytes whereas the ext-net example header files set it for NCCL Net v3. Since, `aws-ofi-nccl` plugin plans to follow the example header files, fix it here. Signed-off-by: Rashika Kheria <[email protected]> * 2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit * 2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize. * porting from rccl Add MSCCL Support #658 * continue porting for #658, add misc/msccl * porting from rccl Add MSCCL Support #694 * remove unnessary changing history comments * resolve the build related issue * enable msccl when generate the top info during initialization * fix nccl BF check issue * fix the memory alignmentissue for LL prooto * add fp8 support * add alltoall interface * add mullt-imemory supportfor fp8 * fix the test script bug which failed to generate the algo file in certain conditions * fix the protocol simple thread count issue aand impprove the test tool with operator support * fix test script issue * fix the memory confliict issue in simple protocol * fix the ll128 share memory issue and 1 process with multiple GPU issue * remove the unessary code to enable cuda graph and add perf test scenarios * fix fp8 issue * fix max,min mismatch issue and ll128 shared memory issue * turn off the ops when ops index equal or larger than avg * optimize the test script to accomodate the gpu numbers * support perf test env initialization * fix environment prepare bug * fix environment prepare bug * enable auto build if the nccl does not build before * add custimozed topo and graph file support during test * enable ndv5 test scenarios * enable cuda graph for multiple node * initiate the multi node test * enable ncv4 compability * fix multi-node test issue * enable multi-node for ndv4 test scenario * ix ib bandwidth test case issue * fix the fence, proxy和sync flags setting related issues * unified the topo file name for different sku * add vmss creatation script * fix test case issue for multi-node scenario * change the algo for multi node scenario * change the maxbyte to smaller like 65565 for multi node scenario * remove test related asset --------- Signed-off-by: Felix Abecassis <[email protected]> Signed-off-by: Rashika Kheria <[email protected]> Signed-off-by: Jonas Zhou <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Boris Fomitchev <[email protected]> Co-authored-by: Pau Farré <[email protected]> Co-authored-by: Adam Paszke <[email protected]> Co-authored-by: Sylvain Jeaugey <[email protected]> Co-authored-by: Nathan Luehr <[email protected]> Co-authored-by: jiakai <[email protected]> Co-authored-by: Cliff Woolley <[email protected]> Co-authored-by: Kyle Fernandes, ne Jacobs <[email protected]> Co-authored-by: Peter Jin <[email protected]> Co-authored-by: Chad Whipkey <[email protected]> Co-authored-by: Ilya Biryukov <[email protected]> Co-authored-by: sclarkson <[email protected]> Co-authored-by: Obihörnchen <[email protected]> Co-authored-by: David Addison <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Alex Sergeev <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Christian Sigg <[email protected]> Co-authored-by: Rong Ou <[email protected]> Co-authored-by: Cao Zongyan <[email protected]> Co-authored-by: Gustavo Alvarez <[email protected]> Co-authored-by: jakirkham <[email protected]> Co-authored-by: Rajat Chopra <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Hirochika Asai <[email protected]> Co-authored-by: Luke Yeager <[email protected]> Co-authored-by: Rashika Kheria <[email protected]> Co-authored-by: aokomoriuta <[email protected]> Co-authored-by: Riatre Foo <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Jack Snyder <[email protected]> Co-authored-by: xietingwew <[email protected]> Co-authored-by: Jonas Zhou <[email protected]> Co-authored-by: John Bachan <[email protected]> Co-authored-by: Chris Jones <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Chang Lan <[email protected]> Co-authored-by: void-main <[email protected]> Co-authored-by: Felix Abecassis <[email protected]> Co-authored-by: Christopher Hesse <[email protected]> Co-authored-by: Ching-Hsiang Chu <[email protected]> Co-authored-by: Jane Xu <[email protected]> Co-authored-by: root <root@msccl-dev-vm001.x2jpmuhl2viupllgh1ckdc4gxd.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-vm-02.htvlvqjeb4pexlwzwcnaprilfa.xx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC0100000U.irhdb45foede3hu3f0yq1jgp5c.cdmx.internal.cloudapp.net> Co-authored-by: root <root@CDM10PrdGPC01000005.uksqnlgezptuti3j2a4ygiva3d.cdmx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@msccl-dev000002.o3im4y5givoubowjunffr20noh.jx.internal.cloudapp.net> Co-authored-by: root <root@liand-h100-validation-vmss000003.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net> Co-authored-by: root <root@Msccl-Dev-000000.zfjavgfi4r0uxbdrouatln1g4f.jx.internal.cloudapp.net> * Enable msccl capability (#4) * Added files via upload * Delete libnccl-dev_1.1.1+cuda75_amd64.deb * Delete libnccl1_1.1.1+cuda75_amd64.deb * Use arch=5.3 as well * Version with . 7.5 * fixed version format * Removing Tegra * Enable compilation with old g++ when the default g++ is not supported (+5.0) * Add --no-as-needed to make sure that cudart library gets liked * Fix MPI test usage Only display usage from rank 0 and exit instead of continuing (and seg fault). * Fix random deadlock during ncclCommInitRank. * Fix readme to reflect the new test paths * Moved no-as-needed flag to link rule. Avoids link errors for tests linked with nvcc. * Fixed install location, new .deb version * Fixed version in ChangeLog * Makefile improvements - Use standard CXX env var - Permit redefinition of more env - Separate lib from tests * Removing unneeded includes * Better name for GENCODE * Bump to 1.2.2 * Gencodes changed to NV recommended * Changed CURAND generator to work on a wider set of platforms. * Make NCCL collectives work on communicators with only one rank * Only call the CUDA runtime. That may fix #27. * Updating for .deb rebuild * Include link to blog post in README.md * Rework debian packaging * Fix make install to use BUILDDIR * Move deb to build directory * Packaging : Generate shlibs.local * Increased version to 1.2.3 * Add a debug level to NCCL and CUDA versions at init * Fix version number * Improved Deb generation * Fixed redundant contexts in multi-process apps Change-Id: If787014450fd281304f0c7baf01d25963e40905d * Remove unneeded deb build script * link library with -lrt; otherwise there is undefined reference to shm_open * pass devlist as const int* rather than int* in ncclCommInitAll * Updated LICENCE.txt * Update LICENSE.txt * Fix MPI test path * Add profiling API * Heavy code refactoring to remove a lot of code in collectives (~1000 lines). Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern. * Make tests check for deltas and report bandwidth * Add scan tests * Improved allreduce segmentation for small sizes * NVML (libwrap) : import the needed definitions * Fix primitives function prototype * Bump to 1.3.1 * Add Fortran bindings * Add Copyright header to Fortran bindings source files * Remove irrelevant output from ncclReduce Fortran tests * Add a static library target "staticlib" to the Makefile. Rename the static library "libnccl_static.a" to disambiguate from the dynamic libraries. * Replace min BW by average BW in tests * 1.3.2 release Broadcast tuning Better checking of inputs Copy/reduce code simplification * Adding missing file * Fix 1.3.2 compilation * Qualify nullptr_t with std::. * Fix crash in Reduce when non-root ranks have invalid recvbuff * Fix copy/paste typo in error message * Only enable peer access for ring neighbors. This enables support for systems with more than 9 GPUs attached to a single PCIe root complex. * Bumping version to 1.3.3 * Fix compilation error when compiling with 'clang -x cuda'. Functions vFetch and vStore are not found by ADL with clang, so they need to be declared before usage in ReduceCopy. * Added Pascal nvcc flags, bumped version * Add support for CUDA9 half semantics * Update README to link to NCCL2 * Update README to link to NCCL2 #2 * Update README to link to NCCL2 part 3 * Update README to link to NCCL2 * fix tests on maxwell * 2.3.5-5 Add support for inter-node communication using sockets and InfiniBand/RoCE. Improve latency. Add support for aggregation. Improve LL/regular tuning. Remove tests as those are now at github.com/nvidia/nccl-tests . * Fix nccl-tests all_reduce_perf path It's `all_reduce_perf` not `allreduce_perf` * 2.3.7-1 Improved LL tuning for multi-node jobs. Improved bootstrap for large job scaling. Fixed a hang during bootstrap due to socket reuse. Added operation name to the COLL INFO logging. * Add install target Fix issue #145 * Add instructions to install packaging toolchain Address #143 and #150 : debuild not installed. * Add official builds download link * Generate nccl.h in build instead of src Generating nccl.h in src makes source directories dirty after builds. * Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) (#156) * Add support for external network. Dynamically load external network from libnccl-net.so. Add init function in networks. Move PCI scoring to net.cu, only ask transport to provide a path. Simplify CUDA PCI path detection. Add dummy external network * Make network isend/irecv non blocking * Improve net API description * Rework SYSCHECK macros to better handle retries. SYSCHECKVAL was not retrying when a retry was needed. Since not all calls are inside a loop, that means we could silently miss an EINTR/EAGAIN return code. Also rework the socket connection code and improve error reporting. * Rework shared memory code to use SYSCHECK macros. This is to handle EINTR/EGAIN properly (issue #137), and also make the code consistent with the rest. Unfortunately posix_fallocate and mmap do not follow the classic return code/errno pattern, so we need to write wrappers around those functions. * Fixed some compilation errors when TRACE=1 set * Improve INFO message when external network is not found. Fix #162 * Add NCCL_NET flag to many debug lines. * Fix GPU Direct RDMA detection. Whether the network supported GPU Direct RDMA or not was ignored, causing sockets to break when cards were local enough that NCCL tried to use it. * Remove error logging from a normal path When initNet fails, we should not print the backtrace as it is supposed to be normal operation (falling back to sockets) * Fix dummy plugin * Fix #163 : remove warnings * Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. * Two temporary workarounds for cuda-clang issues. * Qualify nullptr_t with std:: * Replace CUDA_VERSION by CUDART_VERSION * Fix memory leak in bootstrapRoot() * 2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it. * Fix crash during shared memory creation (#185) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <[email protected]> * Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other. * NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add N…
1 parent c0a0acc commit cc07513

File tree

187 files changed

+45721
-458
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+45721
-458
lines changed

.gitignore

+4-398
Large diffs are not rendered by default.

LICENSE

+40-21
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,40 @@
1-
MIT License
2-
3-
Copyright (c) Microsoft Corporation.
4-
5-
Permission is hereby granted, free of charge, to any person obtaining a copy
6-
of this software and associated documentation files (the "Software"), to deal
7-
in the Software without restriction, including without limitation the rights
8-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9-
copies of the Software, and to permit persons to whom the Software is
10-
furnished to do so, subject to the following conditions:
11-
12-
The above copyright notice and this permission notice shall be included in all
13-
copies or substantial portions of the Software.
14-
15-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21-
SOFTWARE
1+
2+
Copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.
3+
Modifications Copyright (c) Microsoft Corporation. Licensed under the MIT License.
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions
7+
are met:
8+
* Redistributions of source code must retain the above copyright
9+
notice, this list of conditions and the following disclaimer.
10+
* Redistributions in binary form must reproduce the above copyright
11+
notice, this list of conditions and the following disclaimer in the
12+
documentation and/or other materials provided with the distribution.
13+
* Neither the name of NVIDIA CORPORATION, Lawrence Berkeley National
14+
Laboratory, the U.S. Department of Energy, nor the names of their
15+
contributors may be used to endorse or promote products derived
16+
from this software without specific prior written permission.
17+
18+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
19+
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
21+
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
22+
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
23+
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
24+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
25+
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
26+
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29+
30+
The U.S. Department of Energy funded the development of this software
31+
under subcontract 7078610 with Lawrence Berkeley National Laboratory.
32+
33+
34+
This code also includes files from the NVIDIA Tools Extension SDK project.
35+
36+
See:
37+
38+
https://github.com/NVIDIA/NVTX
39+
40+
for more information and license details.

LICENSE.txt

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
2+
Copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.
3+
Modifications Copyright (c) Microsoft Corporation. Licensed under the MIT License.
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions
7+
are met:
8+
* Redistributions of source code must retain the above copyright
9+
notice, this list of conditions and the following disclaimer.
10+
* Redistributions in binary form must reproduce the above copyright
11+
notice, this list of conditions and the following disclaimer in the
12+
documentation and/or other materials provided with the distribution.
13+
* Neither the name of NVIDIA CORPORATION, Lawrence Berkeley National
14+
Laboratory, the U.S. Department of Energy, nor the names of their
15+
contributors may be used to endorse or promote products derived
16+
from this software without specific prior written permission.
17+
18+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
19+
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
21+
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
22+
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
23+
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
24+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
25+
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
26+
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29+
30+
The U.S. Department of Energy funded the development of this software
31+
under subcontract 7078610 with Lawrence Berkeley National Laboratory.
32+
33+
34+
This code also includes files from the NVIDIA Tools Extension SDK project.
35+
36+
See:
37+
38+
https://github.com/NVIDIA/NVTX
39+
40+
for more information and license details.

Makefile

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#
2+
# Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
3+
#
4+
# See LICENSE.txt for license information
5+
#
6+
.PHONY : all clean
7+
8+
default : src.build
9+
install : src.install
10+
BUILDDIR ?= $(abspath ./build)
11+
ABSBUILDDIR := $(abspath $(BUILDDIR))
12+
TARGETS := src pkg
13+
clean: ${TARGETS:%=%.clean}
14+
test.build: src.build
15+
LICENSE_FILES := LICENSE.txt
16+
LICENSE_TARGETS := $(LICENSE_FILES:%=$(BUILDDIR)/%)
17+
lic: $(LICENSE_TARGETS)
18+
19+
${BUILDDIR}/%.txt: %.txt
20+
@printf "Copying %-35s > %s\n" $< $@
21+
mkdir -p ${BUILDDIR}
22+
cp $< $@
23+
24+
src.%:
25+
${MAKE} -C src $* BUILDDIR=${ABSBUILDDIR}
26+
27+
pkg.%:
28+
${MAKE} -C pkg $* BUILDDIR=${ABSBUILDDIR}
29+
30+
pkg.debian.prep: lic
31+
pkg.txz.prep: lic

README.md

+78-21
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,90 @@
1-
# Project
1+
# MSCCL-EXECUTOR-NCCL
22

3-
> This repo has been populated by an initial template to help get you started. Please
4-
> make sure to update the content to build a great experience for community-building.
3+
Microsoft Collective Communication Library Exector on NCCL (MSCCL-EXECUTOR-NCCL) is an inter-accelerator communication framework that is built on top of [NCCL](https://github.com/nvidia/nccl) and uses its building blocks to execute custom-written collective communication algorithms.
54

6-
As the maintainer of this project, please make a few updates:
5+
## Introduction
76

8-
- Improving this README.MD file to provide a great experience
9-
- Updating SUPPORT.MD with content about this project's support experience
10-
- Understanding the security reporting process in SECURITY.MD
11-
- Remove this section from the README
7+
MSCCL-EXECUTOR-NCCL is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, all-to-all, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. MSCCL-EXECUTOR-NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications. To achieve this, MSCCL has multiple capabilities:
128

13-
## Contributing
9+
- Programmibility: Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. MSCCL-EXECUTOR-NCCL allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. This is possbile through two main components: [MSCCL toolkit](https://github.com/microsoft/msccl-tools) and [MSCCL-EXECUTOR-NCCL](https://github.com/Azure/msccl-executor-nccl) (this repo). MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL runtime (this repo) to run on the backend. MSCCL will automatically fall back to a NCCL's generic algorithm in case there is no custom algorithm. [Example](#Example) provides some instances on how MSCCL toolkit with the runtime works. Please refer to [MSCCL toolkit](https://github.com/microsoft/msccl-tools) for more information.
10+
- Profiling: MSCCL-EXECUTOR-NCCL has a profiling tool [NPKit](https://github.com/microsoft/npkit) which provides detailed timeline for each primitive send and receive operation to understand the bottlenecks in a given collective communication algorithms.
1411

15-
This project welcomes contributions and suggestions. Most contributions require you to agree to a
16-
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
17-
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
12+
## Build
13+
14+
To build the library :
15+
16+
```sh
17+
$ git clone https://github.com/microsoft/msccl.git --recurse-submodules
18+
$ cd msccl/executor/msccl-executor-nccl
19+
$ make -j src.build
20+
```
21+
22+
If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :
23+
24+
```sh
25+
$ make src.build CUDA_HOME=<path to cuda install>
26+
```
27+
28+
MSCCL-EXECUTOR-NCCL will be compiled and installed in `build/` unless `BUILDDIR` is set.
29+
30+
By default, MSCCL-EXECUTOR-NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining `NVCC_GENCODE` (defined in `makefiles/common.mk`) to only include the architecture of the target platform :
31+
```sh
32+
$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"
33+
```
34+
35+
## Install
36+
37+
To install MSCCL-EXECUTOR-NCCL on the system, create a package then install it as root.
1838

19-
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
20-
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
21-
provided by the bot. You will only need to do this once across all repos using our CLA.
39+
Debian/Ubuntu :
40+
```sh
41+
$ # Install tools to create debian packages
42+
$ sudo apt install build-essential devscripts debhelper fakeroot
43+
$ # Build MSCCL-EXECUTOR-NCCL deb package
44+
$ make pkg.debian.build
45+
$ ls build/pkg/deb/
46+
```
2247

23-
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
24-
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
25-
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
48+
RedHat/CentOS :
49+
```sh
50+
$ # Install tools to create rpm packages
51+
$ sudo yum install rpm-build rpmdevtools
52+
$ # Build MSCCL-EXECUTOR-NCCL rpm package
53+
$ make pkg.redhat.build
54+
$ ls build/pkg/rpm/
55+
```
2656

27-
## Trademarks
57+
OS-agnostic tarball :
58+
```sh
59+
$ make pkg.txz.build
60+
$ ls build/pkg/txz/
61+
```
2862

29-
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
30-
trademarks or logos is subject to and must follow
63+
## Tests
64+
65+
Tests for MSCCL-EXECUTOR-NCCL are maintained separately at https://github.com/Azure/msccl-tests-nccl.
66+
67+
```sh
68+
$ git clone https://github.com/Azure/msccl-tests-nccl.git
69+
$ cd msccl-tests-nccl
70+
$ make
71+
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>
72+
```
73+
74+
For more information on NCCL usage, please refer to the [NCCL documentation](https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html).
75+
76+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
77+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
78+
the rights to use your contribution. For details, visit [CLA](https://cla.opensource.microsoft.com).
79+
80+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
81+
trademarks or logos is subject to and must follow
3182
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
3283
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
3384
Any use of third-party trademarks or logos are subject to those third-party's policies.
85+
86+
## Copyright
87+
88+
All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
89+
90+
All modifications are copyright (c) 2022-2023, Microsoft Corporation. All rights reserved.

SUPPORT.md

+9-18
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,16 @@
1-
# TODO: The maintainer of this repo has not yet edited this file
2-
3-
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
4-
5-
- **No CSS support:** Fill out this template with information about how to file issues and get help.
6-
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
7-
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
8-
9-
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
10-
111
# Support
122

13-
## How to file issues and get help
3+
## How to file issues and get help
144

15-
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
16-
issues before filing new issues to avoid duplicates. For new issues, file your bug or
17-
feature request as a new Issue.
5+
This project uses [GitHub Issues] to track bugs and feature requests. Please search the existing
6+
issues before filing new issues to avoid duplicates. For new issues, file your bug or
7+
feature request as a new issue.
188

19-
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
20-
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
21-
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
9+
For help and questions about using this project, please create a new post in [GitHub Discussions].
2210

23-
## Microsoft Support Policy
11+
## Microsoft Support Policy
2412

2513
Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
14+
15+
[GitHub Issues]: https://github.com/Azure/msccl-executor-nccl/issues
16+
[GitHub Discussions]: https://github.com/Azure/msccl-executor-nccl/discussions

cgmanifest.json

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"registrations": [
3+
{
4+
"component": {
5+
"type": "git",
6+
"git": {
7+
"repositoryUrl": "https://github.com/NVIDIA/nccl.git",
8+
"commitHash": "5d3ab08"
9+
}
10+
}
11+
}
12+
]
13+
}
14+

0 commit comments

Comments
 (0)