You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on experience at scale - we need to add a preflight NCCL and GEMM check to our multi-node slurm script.
The idea here is the NCCL check verifies network and the GEMM check verifies no XID or other random issues that will create Nans.
This is based on feedback from another person who manages a large cluster as a way to avoid launching large scale runs to have them time out.
(if it hangs at the start, then the resetting of the timeout we do at iter 1-2 doesn't help b/c it never go there).
Saw that a lot this weekend due to slow nodes and consumed a ton of time.
The text was updated successfully, but these errors were encountered:
Based on experience at scale - we need to add a preflight NCCL and GEMM check to our multi-node slurm script.
The idea here is the NCCL check verifies network and the GEMM check verifies no XID or other random issues that will create Nans.
This is based on feedback from another person who manages a large cluster as a way to avoid launching large scale runs to have them time out.
(if it hangs at the start, then the resetting of the timeout we do at iter 1-2 doesn't help b/c it never go there).
Saw that a lot this weekend due to slow nodes and consumed a ton of time.
The text was updated successfully, but these errors were encountered: