fix nccl future execution #126

H-Huang · 2025-03-10T17:06:48Z

Async NCCL collectives dont block the CPU which means their futures will also get executed. This causes an issue in manager.py when we modify the allreduce tensor in a callback (https://github.com/pytorch/torchft/blob/main/torchft/manager.py#L292) as the tensor may not have finished allreduce.

The fix is to add an event after the allreduce has been wait()ed and then wait on this event before setting the future.

Created a test to repro:

pytest torchft/manager_integ_test.py -vsk test_manager_allreduce

d4l3k

LGTM thanks for fixing this!

H-Huang marked this pull request as draft March 10, 2025 17:06

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2025

H-Huang requested a review from d4l3k March 10, 2025 18:07

H-Huang force-pushed the diloco branch 3 times, most recently from 7753bbb to b23a598 Compare March 10, 2025 20:53

H-Huang marked this pull request as ready for review March 10, 2025 21:06

H-Huang changed the title ~~[WIP] fix nccl future execution~~ fix nccl future execution Mar 10, 2025

H-Huang requested a review from fegin March 10, 2025 21:08

H-Huang force-pushed the diloco branch 2 times, most recently from 70d648a to fa61e1d Compare March 10, 2025 21:13

Fix nccl future execution

f0002a3

H-Huang force-pushed the diloco branch from fa61e1d to f0002a3 Compare March 10, 2025 21:16

d4l3k approved these changes Mar 10, 2025

View reviewed changes

H-Huang merged commit 8f021e1 into pytorch:main Mar 10, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix nccl future execution #126

fix nccl future execution #126

H-Huang commented Mar 10, 2025 •

edited

Loading

d4l3k left a comment

fix nccl future execution #126

fix nccl future execution #126

Conversation

H-Huang commented Mar 10, 2025 • edited Loading

d4l3k left a comment

Choose a reason for hiding this comment

H-Huang commented Mar 10, 2025 •

edited

Loading