You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When finetuning on 4 or 8 GPUs (A100s), I get the error 'Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2161, OpType=ALLREDUCE, NumelIn=99876864, NumelOut=99876864, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.' when the checkpoint is saving.
Finetuning on 1 A100 is okay.
For reference, the lines that I changed to run on 4 GPUs are:
It seems there might be an issue with cluster communication, try saving after each training step(for debug just set 1 step for training)? It feels like an occasional phenomenon.
When finetuning on 4 or 8 GPUs (A100s), I get the error 'Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2161, OpType=ALLREDUCE, NumelIn=99876864, NumelOut=99876864, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.' when the checkpoint is saving.
Finetuning on 1 A100 is okay.
For reference, the lines that I changed to run on 4 GPUs are:
accelerate_config_machine_single.yaml:
gpu_ids: "0, 1, 2, 3"
num_processes = 4
finetune_single_rank.sh:
export CUDA_VISIBLE_DEVICES=0,1,2,3
The text was updated successfully, but these errors were encountered: