[Rem Allocator] Allocation failed #555

xutianming · 2021-08-17T09:54:22Z

When training with A100 GPUs , I encountered OOM error of Rem Allocator. As far as I know, the remote allocator is only used to do send/receive between GPUs which do not have a direct NVLink connection but can communicate through an intermediate GPU.

But in my case, the GPUs are fully connected through NVLink. Why did I still meet this error?
cc @sjeaugey

The text was updated successfully, but these errors were encountered:

sjeaugey · 2021-08-17T12:06:57Z

Is that issue reproducible or did it only happen once? It seems the remote allocation thread is being sent a message that was not for it, hence experiences random errors: socket closing too soon, then first 64 bits sent -- the size -- being random data hence beyond the GPU memory size.

Would you have more of the log, in particular what happens for the main threads after they mistakenly communicate with the remote memory allocation thread? (main threads are the ones printing "NCCL INFO ... via P2P/IPC/read")

xutianming · 2021-08-18T03:32:36Z

@sjeaugey The issue rarely occurred, and I only met once. The main threads continued working without more logs. So the error can be ignored safely, right?

I am trying to reproduce it with NCCL_DEBUG_SUBSYS=INIT,ALLOC for more logs.

sjeaugey · 2021-08-18T09:47:42Z

No need to add ALLOC in the DEBUG_SUBSYS list, that would probably make it very large yet not more useful. Please keep NCCL_DEBUG_SUBSYS unset (default).

I'd like to see what the main thread was doing when it connected to the remote mem alloc thread, and see if there was any error after that that would indicate where it is. The only thing I see in your screenshot is what happened just before.

xutianming · 2021-08-20T07:26:32Z

These are all the NCCL logs I got and the job continued without other errors. Since I got Connection closed by remote peer here, I guess there might be an unexpected failed socket/queue pair ?

sjeaugey · 2021-08-20T12:00:30Z

Okay, it is surprising that the job managed to continue without errors ... Now if that's not NCCL connecting to the remote allocator (like, some random other service connecting to the wrong port), the remote allocator will simply ignore the request and indeed everything else will work just fine...

chr1sj0nes · 2021-11-15T14:33:45Z

We have also seen spurious remote allocation requests. We're trying to understand the source of these, but I've sent a PR (#599) that should provide some basic protection against it.

Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments.

Squashed commit of the following: commit 28189e2 Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 29 04:27:46 2022 -0800 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves NVIDIA#555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. commit 614b49f Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 22 02:13:13 2022 -0800 Fix google-fastsocket plugin build commit 55b1d8a Author: Sylvain Jeaugey <[email protected]> Date: Mon Nov 21 06:03:27 2022 -0800 Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. commit 2f4cb87 Merge: d128d62 cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 01:15:22 2022 -0700 Merge tag 'v2.15.5-1' commit cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 00:55:55 2022 -0700 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. commit d128d62 Merge: 2401f4a da8152e Author: Sylvain Jeaugey <[email protected]> Date: Fri Oct 7 11:00:26 2022 -0700 Merge tag 'v2.15.1-1' commit 2401f4a Author: John Bachan <[email protected]> Date: Mon Oct 3 17:02:15 2022 -0700 Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA#726 commit da8152e Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:31:13 2022 -0700 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. commit 99c28f2 Merge: 78313a6 ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:24:41 2022 -0700 Merge remote-tracking branch 'origin/master' commit 78313a6 Author: Cliff Woolley <[email protected]> Date: Fri Aug 26 15:00:18 2022 -0700 Use compatibility shim only with static cudart Closes issue 658 commit ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Thu Sep 22 01:04:50 2022 -0700 Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve NVIDIA#623. commit f89fd47 Author: Jane Xu <[email protected]> Date: Wed Sep 14 11:16:17 2022 -0400 address review comments commit 79fb032 Author: Jane Xu <[email protected]> Date: Tue Sep 13 16:05:21 2022 -0400 Fix intermittent 11.6 builds: generate unique .cu file for each object file commit c4e2aa6 Author: Sylvain Jeaugey <[email protected]> Date: Thu Aug 18 02:53:17 2022 -0700 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. commit e1d9b27 Author: Ching-Hsiang Chu <[email protected]> Date: Wed Aug 3 20:47:40 2022 -0700 fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Summary: Squashed commit of the following: commit 28189e2 Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 29 04:27:46 2022 -0800 2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves NVIDIA#555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments. commit 614b49f Author: Sylvain Jeaugey <[email protected]> Date: Tue Nov 22 02:13:13 2022 -0800 Fix google-fastsocket plugin build commit 55b1d8a Author: Sylvain Jeaugey <[email protected]> Date: Mon Nov 21 06:03:27 2022 -0800 Add documentation for NCCL NET plugins Also repurpose dummy plugin as example, including headers and compat layers from v6 to v2. commit 2f4cb87 Merge: d128d62 cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 01:15:22 2022 -0700 Merge tag 'v2.15.5-1' commit cb111f7 Author: Sylvain Jeaugey <[email protected]> Date: Tue Oct 25 00:55:55 2022 -0700 2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs. commit d128d62 Merge: 2401f4a da8152e Author: Sylvain Jeaugey <[email protected]> Date: Fri Oct 7 11:00:26 2022 -0700 Merge tag 'v2.15.1-1' commit 2401f4a Author: John Bachan <[email protected]> Date: Mon Oct 3 17:02:15 2022 -0700 Fixes a double-free in the error path of ncclCommInitAll. Fixes NVIDIA#726 commit da8152e Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:31:13 2022 -0700 2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities. commit 99c28f2 Merge: 78313a6 ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Tue Sep 27 02:24:41 2022 -0700 Merge remote-tracking branch 'origin/master' commit 78313a6 Author: Cliff Woolley <[email protected]> Date: Fri Aug 26 15:00:18 2022 -0700 Use compatibility shim only with static cudart Closes issue 658 commit ecab28a Author: Sylvain Jeaugey <[email protected]> Date: Thu Sep 22 01:04:50 2022 -0700 Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve NVIDIA#623. commit f89fd47 Author: Jane Xu <[email protected]> Date: Wed Sep 14 11:16:17 2022 -0400 address review comments commit 79fb032 Author: Jane Xu <[email protected]> Date: Tue Sep 13 16:05:21 2022 -0400 Fix intermittent 11.6 builds: generate unique .cu file for each object file commit c4e2aa6 Author: Sylvain Jeaugey <[email protected]> Date: Thu Aug 18 02:53:17 2022 -0700 2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier. commit e1d9b27 Author: Ching-Hsiang Chu <[email protected]> Date: Wed Aug 3 20:47:40 2022 -0700 fix NCCL_DEBUG_FILE Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior) Differential Revision: D38415208 fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30 Pull Request resolved: #17 Test Plan: These commands are to be applied to the whole stack rather than this specific commit: ``` buck2 run fbcode//mode/opt -c hpc_comms.use_nccl=exp fbcode//param_bench/train/comms/cpp/nccl-tests:nccl_tests_launcher -- --launcher mast --hw tc_any --nnode 2 --collective allreduce,alltoall,allgather,reducescatter --nccl-args "-b 4 -e 1G -f 2 -z 1" --dp ai_system_sw_hw_co-design_cws --entitlement codesign ``` https://www.internalfb.com/mast/job/torchx-nccl-test-allreduce-alltoall-allgather-reducescatter-3301bc Reviewed By: agangidi53 Differential Revision: D42194038 Pulled By: spease-fb fbshipit-source-id: e2cde6c44bcb8494c9dd02e926938f27bbc8f43b

chr1sj0nes mentioned this issue Nov 15, 2021

Add logging and basic verification to remote allocator. #599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rem Allocator] Allocation failed #555

[Rem Allocator] Allocation failed #555

xutianming commented Aug 17, 2021 •

edited

Loading

sjeaugey commented Aug 17, 2021

xutianming commented Aug 18, 2021

sjeaugey commented Aug 18, 2021 •

edited

Loading

xutianming commented Aug 20, 2021

sjeaugey commented Aug 20, 2021

chr1sj0nes commented Nov 15, 2021

[Rem Allocator] Allocation failed #555

[Rem Allocator] Allocation failed #555

Comments

xutianming commented Aug 17, 2021 • edited Loading

sjeaugey commented Aug 17, 2021

xutianming commented Aug 18, 2021

sjeaugey commented Aug 18, 2021 • edited Loading

xutianming commented Aug 20, 2021

sjeaugey commented Aug 20, 2021

chr1sj0nes commented Nov 15, 2021

xutianming commented Aug 17, 2021 •

edited

Loading

sjeaugey commented Aug 18, 2021 •

edited

Loading