Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing when saving checkpoint on multi-GPU finetuning #735

Open
emilychen555 opened this issue Mar 7, 2025 · 3 comments
Open

Crashing when saving checkpoint on multi-GPU finetuning #735

emilychen555 opened this issue Mar 7, 2025 · 3 comments
Assignees

Comments

@emilychen555
Copy link

emilychen555 commented Mar 7, 2025

When finetuning on 4 or 8 GPUs (A100s), I get the error 'Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2161, OpType=ALLREDUCE, NumelIn=99876864, NumelOut=99876864, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.' when the checkpoint is saving.

Finetuning on 1 A100 is okay.

For reference, the lines that I changed to run on 4 GPUs are:

accelerate_config_machine_single.yaml:
gpu_ids: "0, 1, 2, 3"
num_processes = 4
finetune_single_rank.sh:
export CUDA_VISIBLE_DEVICES=0,1,2,3

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Mar 7, 2025
@zRzRzRzRzRzRzR
Copy link
Member

zRzRzRzRzRzRzR commented Mar 7, 2025

It seems there might be an issue with cluster communication, try saving after each training step(for debug just set 1 step for training)? It feels like an occasional phenomenon.

@emilychen555
Copy link
Author

I just tried -- it still crashes when saving the checkpoint.

@zRzRzRzRzRzRzR
Copy link
Member

I hope to know your CUDA environment, as we seem to be unable to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants