Crashing when saving checkpoint on multi-GPU finetuning #735

emilychen555 · 2025-03-07T03:27:58Z

When finetuning on 4 or 8 GPUs (A100s), I get the error 'Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2161, OpType=ALLREDUCE, NumelIn=99876864, NumelOut=99876864, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.' when the checkpoint is saving.

Finetuning on 1 A100 is okay.

For reference, the lines that I changed to run on 4 GPUs are:

accelerate_config_machine_single.yaml:
gpu_ids: "0, 1, 2, 3"
num_processes = 4
finetune_single_rank.sh:
export CUDA_VISIBLE_DEVICES=0,1,2,3

The text was updated successfully, but these errors were encountered:

zRzRzRzRzRzRzR · 2025-03-07T07:11:12Z

It seems there might be an issue with cluster communication, try saving after each training step(for debug just set 1 step for training)? It feels like an occasional phenomenon.

emilychen555 · 2025-03-07T09:22:07Z

I just tried -- it still crashes when saving the checkpoint.

zRzRzRzRzRzRzR · 2025-03-08T04:25:39Z

I hope to know your CUDA environment, as we seem to be unable to reproduce it.

zRzRzRzRzRzRzR self-assigned this Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing when saving checkpoint on multi-GPU finetuning #735

Crashing when saving checkpoint on multi-GPU finetuning #735

emilychen555 commented Mar 7, 2025 •

edited

Loading

zRzRzRzRzRzRzR commented Mar 7, 2025 •

edited

Loading

emilychen555 commented Mar 7, 2025

zRzRzRzRzRzRzR commented Mar 8, 2025

Crashing when saving checkpoint on multi-GPU finetuning #735

Crashing when saving checkpoint on multi-GPU finetuning #735

Comments

emilychen555 commented Mar 7, 2025 • edited Loading

zRzRzRzRzRzRzR commented Mar 7, 2025 • edited Loading

emilychen555 commented Mar 7, 2025

zRzRzRzRzRzRzR commented Mar 8, 2025

emilychen555 commented Mar 7, 2025 •

edited

Loading

zRzRzRzRzRzRzR commented Mar 7, 2025 •

edited

Loading