Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support building NCCL with RDMA-core [part 2 of 2]: NCCL_BUILD_RDMA_CORE #802

Closed
wants to merge 3 commits into from

Conversation

vtlam
Copy link

@vtlam vtlam commented Mar 14, 2023

No description provided.

@vtlam vtlam changed the title Support building NCCL with RDMA-core #2: NCCL_BUILD_RDMA_CORE Support building NCCL with RDMA-core [part 2 of 2]: NCCL_BUILD_RDMA_CORE Mar 14, 2023
@@ -21,11 +21,16 @@ ncclResult_t wrap_ibv_symbols(void) {
}

/* CHECK_NOT_NULL: helper macro to check for NULL symbol only with dynamic loading */
#ifdef NCCL_BUILD_RDMA_CORE
#define CHECK_NOT_NULL(container, internal_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we even need that. If the check is defined as a no-op, then it means we consider the function pointer cannot be NULL, and therefore the check will always be ok. Unless, that is to optimize performance?

Copy link
Author

@vtlam vtlam Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is for performance. Certain verb calls (e.g., ibv_post_send(), ibv_post_recv(), ibv_poll_cq()) can happen quite frequently. That said, I am ok with removing it if you have a strong opinion on pointer safety.

@vtlam vtlam force-pushed the rdma_core_build branch from 18fabf7 to 0386f73 Compare March 21, 2023 02:56
sjeaugey added a commit that referenced this pull request Apr 19, 2023
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
@sjeaugey
Copy link
Member

sjeaugey commented Nov 9, 2023

This has been added as part of NCCL 2.19.3. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants