-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print remote and local GID in error trace #779
base: master
Are you sure you want to change the base?
Conversation
Add local and remote GID info in error trace. Modified ncclIbSendComm and ncclIbReceiveComm to include remote and local GID information.
Thanks for the request. It makes sense. I'm not sure I'd have implemented it that way, but this would indeed help debugging. We're pretty busy at the moment so not sure we'll have time to work on that soon, but please ping us again if we don't make any progress. |
src/transport/net_ib.cc
Outdated
ncclSocketToString(&addr, line), wc->status, wc->opcode, wc->byte_len, wc->vendor_err); | ||
if (r->sendComm != NULL) { | ||
inet_ntop(AF_INET6, &(r->sendComm->localGid), localGid, sizeof(localGid)); | ||
inet_ntop(AF_INET6, &(r->sendComm->remoteGid), remoteGid, sizeof(remoteGid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These inet_ntop() calls may fail by returning NULL (see https://man7.org/linux/man-pages/man3/inet_ntop.3.html). For completeness, I think we should check inet_ntop() failures, in which case we can either fall back to the old WARN message or just print "unknown" for whichever GID that failed?
Also I think you can refactor this error handle out to a helper method if you feel it is getting complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the logic to handle inet_ntop failures.
Add comment for remoteGid and localGid varialbe in recv/send comm to explain what they are and why we need them in the struct.
Since it is possible for inet_ntop to fail, add logic to handle the case where inet_ntop failed to parse GID as an INET6 address. Also combined some common logic to make the code a little cleaner.
Fix a bug that will cause segfault due to reading sendComm in the path where it is supposed to read readComm.
There are compilation warnings from your patch. I believe the WARN messages are not being passed the correct string parameters:
|
Previeously the WARN message was using the ibv_gid as string by mistake. Change it to use the correct string version of localGid and remoteGid.
Ah thanks for the heads up! Yes I am supposed to use the string version. Just changed it. |
Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.
The remote and local GID information would be very useful in finding the root cause of a network issue. It could help us quickly pinpoint the two faulty GIDs without inspecting all the possible paths. Therefore, we would like to include this information in the error trace to speed up error tracing.
Tested the performance with
all_reduce_perf -b 8 -e 256M -f 2 -g 2
before and after the change, there is no impact to the performance:Before:
After: