Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nccl future execution #126

Merged
merged 1 commit into from
Mar 10, 2025
Merged

fix nccl future execution #126

merged 1 commit into from
Mar 10, 2025

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Mar 10, 2025

Async NCCL collectives dont block the CPU which means their futures will also get executed. This causes an issue in manager.py when we modify the allreduce tensor in a callback (https://github.com/pytorch/torchft/blob/main/torchft/manager.py#L292) as the tensor may not have finished allreduce.

The fix is to add an event after the allreduce has been wait()ed and then wait on this event before setting the future.

Created a test to repro:

pytest torchft/manager_integ_test.py -vsk test_manager_allreduce

@H-Huang H-Huang marked this pull request as draft March 10, 2025 17:06
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2025
@H-Huang H-Huang requested a review from d4l3k March 10, 2025 18:07
@H-Huang H-Huang force-pushed the diloco branch 3 times, most recently from 7753bbb to b23a598 Compare March 10, 2025 20:53
@H-Huang H-Huang marked this pull request as ready for review March 10, 2025 21:06
@H-Huang H-Huang changed the title [WIP] fix nccl future execution fix nccl future execution Mar 10, 2025
@H-Huang H-Huang requested a review from fegin March 10, 2025 21:08
@H-Huang H-Huang force-pushed the diloco branch 2 times, most recently from 70d648a to fa61e1d Compare March 10, 2025 21:13
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this!

@H-Huang H-Huang merged commit 8f021e1 into pytorch:main Mar 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants