Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rem Allocator] Allocation failed #555

Open
xutianming opened this issue Aug 17, 2021 · 6 comments
Open

[Rem Allocator] Allocation failed #555

xutianming opened this issue Aug 17, 2021 · 6 comments

Comments

@xutianming
Copy link

xutianming commented Aug 17, 2021

image

When training with A100 GPUs , I encountered OOM error of Rem Allocator. As far as I know, the remote allocator is only used to do send/receive between GPUs which do not have a direct NVLink connection but can communicate through an intermediate GPU.

But in my case, the GPUs are fully connected through NVLink. Why did I still meet this error?
cc @sjeaugey

image

@sjeaugey
Copy link
Member

Is that issue reproducible or did it only happen once? It seems the remote allocation thread is being sent a message that was not for it, hence experiences random errors: socket closing too soon, then first 64 bits sent -- the size -- being random data hence beyond the GPU memory size.

Would you have more of the log, in particular what happens for the main threads after they mistakenly communicate with the remote memory allocation thread? (main threads are the ones printing "NCCL INFO ... via P2P/IPC/read")

@xutianming
Copy link
Author

@sjeaugey The issue rarely occurred, and I only met once. The main threads continued working without more logs. So the error can be ignored safely, right?

I am trying to reproduce it with NCCL_DEBUG_SUBSYS=INIT,ALLOC for more logs.

@sjeaugey
Copy link
Member

sjeaugey commented Aug 18, 2021

No need to add ALLOC in the DEBUG_SUBSYS list, that would probably make it very large yet not more useful. Please keep NCCL_DEBUG_SUBSYS unset (default).

I'd like to see what the main thread was doing when it connected to the remote mem alloc thread, and see if there was any error after that that would indicate where it is. The only thing I see in your screenshot is what happened just before.

@xutianming
Copy link
Author

image
These are all the NCCL logs I got and the job continued without other errors. Since I got Connection closed by remote peer here, I guess there might be an unexpected failed socket/queue pair ?

@sjeaugey
Copy link
Member

Okay, it is surprising that the job managed to continue without errors ... Now if that's not NCCL connecting to the remote allocator (like, some random other service connecting to the wrong port), the remote allocator will simply ignore the request and indeed everything else will work just fine...

@chr1sj0nes
Copy link
Contributor

We have also seen spurious remote allocation requests. We're trying to understand the source of these, but I've sent a PR (#599) that should provide some basic protection against it.

sjeaugey added a commit that referenced this issue Dec 2, 2022
Add support for CUDA 12.0, drop Kepler (sm_35).
Support for H100 features.
Make socket code more robust and protected. Solves #555.
Improve performance on large CUDA graphs, reducing dependencies.
Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
Various fixes to ncclCommAbort.
Make service thread polling resistant to EINTR.
Compile with profiling API by default.
Extend NVTX instrumentation with call arguments.
spease-fb pushed a commit to spease-fb/nccl that referenced this issue Dec 21, 2022
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
spease-fb pushed a commit to spease-fb/nccl that referenced this issue Dec 21, 2022
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30
facebook-github-bot pushed a commit to facebookresearch/nccl that referenced this issue Dec 23, 2022
Summary:
Squashed commit of the following:

commit 28189e2
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 29 04:27:46 2022 -0800

    2.16.2-1

    Add support for CUDA 12.0, drop Kepler (sm_35).
    Support for H100 features.
    Make socket code more robust and protected. Solves NVIDIA#555.
    Improve performance on large CUDA graphs, reducing dependencies.
    Reduce inter-socket bandwidth on AMD CPUs to favor better paths.
    Various fixes to ncclCommAbort.
    Make service thread polling resistant to EINTR.
    Compile with profiling API by default.
    Extend NVTX instrumentation with call arguments.

commit 614b49f
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Nov 22 02:13:13 2022 -0800

    Fix google-fastsocket plugin build

commit 55b1d8a
Author: Sylvain Jeaugey <[email protected]>
Date:   Mon Nov 21 06:03:27 2022 -0800

    Add documentation for NCCL NET plugins

    Also repurpose dummy plugin as example, including headers and
    compat layers from v6 to v2.

commit 2f4cb87
Merge: d128d62 cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 01:15:22 2022 -0700

    Merge tag 'v2.15.5-1'

commit cb111f7
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Oct 25 00:55:55 2022 -0700

    2.15.5-1

    Fix crash with CollnetChain on some node topologies
    Fix hang when interleaving the capture of different graphs
    Fix hang during init in multi-threaded mode
    Fix potential data corruption with LL128 protocol on unaligned buffers.
    Fix CPU usage during preconnect
    Fixes double-free in the error path for ncclCommInitAll
    Workaround hang on H100 with Ring/LL128 on 2 GPUs.

commit d128d62
Merge: 2401f4a da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Fri Oct 7 11:00:26 2022 -0700

    Merge tag 'v2.15.1-1'

commit 2401f4a
Author: John Bachan <[email protected]>
Date:   Mon Oct 3 17:02:15 2022 -0700

    Fixes a double-free in the error path of ncclCommInitAll.

    Fixes NVIDIA#726

commit da8152e
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:31:13 2022 -0700

    2.15.1-1

    Add support for H100 (sm90).
    Make sure NCCL kernel honor user stream priorities.

commit 99c28f2
Merge: 78313a6 ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Tue Sep 27 02:24:41 2022 -0700

    Merge remote-tracking branch 'origin/master'

commit 78313a6
Author: Cliff Woolley <[email protected]>
Date:   Fri Aug 26 15:00:18 2022 -0700

    Use compatibility shim only with static cudart

    Closes issue 658

commit ecab28a
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Sep 22 01:04:50 2022 -0700

    Fix potential deadlock during init in multi-thread mode.

    Make sure all calls calling cudaMalloc (including devCommSetup) are
    called before the last bootstrapBarrier. That way, we avoid calls to
    cudaMalloc be blocked by a NCCL kernel launched on another GPU by
    another thread which completed init faster.

    Resolve NVIDIA#623.

commit f89fd47
Author: Jane Xu <[email protected]>
Date:   Wed Sep 14 11:16:17 2022 -0400

    address review comments

commit 79fb032
Author: Jane Xu <[email protected]>
Date:   Tue Sep 13 16:05:21 2022 -0400

    Fix intermittent 11.6 builds: generate unique .cu file for each object file

commit c4e2aa6
Author: Sylvain Jeaugey <[email protected]>
Date:   Thu Aug 18 02:53:17 2022 -0700

    2.14.3-1

    Add support for improved fault tolerance: non-blocking mode, new
    init function with config, and ncclCommFinalize function.
    Reintroduce collnet+chain algorithm, alongside collnet+direct.
    Add LL protocol for intra-node P2P (on by default) and network
    communication (off by default).
    Use network instead of shared memory when performance is better.
    Fix: wait for CUDA graph destroy before destroying comm with linked
    graph resources.
    Remove aggressive polling during enqueue.
    Fix DMABUF fallback on MOFED 5.4 and earlier.

commit e1d9b27
Author: Ching-Hsiang Chu <[email protected]>
Date:   Wed Aug 3 20:47:40 2022 -0700

    fix NCCL_DEBUG_FILE

    Summary: NCCL_DEBUG_FILE does not work properly since the recent v2.13.4 updates (NVIDIA#682) because it nows sets `ncclDebugLevel` after parse `NCCL_DEBUG_FILE`. This patch move parsing `tempNcclDebugLevel` before processing `NCCL_DEBUG_FILE` to ensure `NCCL_DEBUG_FILE` is parsed only when `NCCL_DEBUG > NCCL_LOG_VERSION` (same as previous behavior)

    Differential Revision: D38415208

    fbshipit-source-id: 5689bbb798e73efb9e8594557666987f07e89a30

Pull Request resolved: #17

Test Plan:
These commands are to be applied to the whole stack rather than this specific commit:

```
buck2 run fbcode//mode/opt -c hpc_comms.use_nccl=exp fbcode//param_bench/train/comms/cpp/nccl-tests:nccl_tests_launcher -- --launcher mast --hw tc_any --nnode 2 --collective allreduce,alltoall,allgather,reducescatter --nccl-args "-b 4 -e 1G -f 2 -z 1" --dp ai_system_sw_hw_co-design_cws --entitlement codesign
```
https://www.internalfb.com/mast/job/torchx-nccl-test-allreduce-alltoall-allgather-reducescatter-3301bc

Reviewed By: agangidi53

Differential Revision: D42194038

Pulled By: spease-fb

fbshipit-source-id: e2cde6c44bcb8494c9dd02e926938f27bbc8f43b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants