You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
check what nccl version your pytorch is at. Only NCCL 2.25.1 currently supports CUDA 12.8 and thus the only version that supports B200. I was able to use the 25.01 pytorch NGC container with torchtitan on B200, albeit on an older version of torchtitan.
When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU. Details:
The text was updated successfully, but these errors were encountered: