Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][experimental] Failure detection doesn't happen properly when the worker fails with RuntimeError #42441

Closed
rkooo567 opened this issue Jan 17, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core core-worker P1 Issue that should be fixed within a few weeks size-small stability

Comments

@rkooo567
Copy link
Contributor

What happened + What you expected to happen

From OSS vllm, remove

        # for worker, (node_id, _) in zip(self.workers, worker_node_and_gpu_ids):
            # worker.set_cuda_visible_devices.remote(node_gpus[node_id])

The execute_method fails with

Error Type: TASK_EXECUTION_EXCEPTION23py", line 69, in init_model4    torch.cuda.set_device(self.device)5  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 404, in set_device6    torch._C._cuda_setDevice(device)7RuntimeError: CUDA error: invalid device ordinal8CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.9For debugging consider passing CUDA_LAUNCH_BLOCKING=1.10Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

but the driver hangs.

Versions / Dependencies

master

Reproduction script

vllm-project/vllm#2462

And comment out code ^

Issue Severity

None

@rkooo567 rkooo567 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 17, 2024
@anyscalesam
Copy link
Contributor

@rkooo567 is this bug still valid?

@stephanie-wang stephanie-wang changed the title [Core/Pathway] Failure detection doesn't happen properly when the worker fails with RuntimeError [Core][experimental] Failure detection doesn't happen properly when the worker fails with RuntimeError Mar 11, 2024
@rkooo567
Copy link
Contributor Author

verfiied I cannot repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core core-worker P1 Issue that should be fixed within a few weeks size-small stability
Projects
None yet
Development

No branches or pull requests

3 participants