Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Using pytest --forked when CUDA is not generally fork-safe #3557

Closed
sangstar opened this issue Mar 21, 2024 · 5 comments · Fixed by #3631
Closed

[Bug]: Using pytest --forked when CUDA is not generally fork-safe #3557

sangstar opened this issue Mar 21, 2024 · 5 comments · Fixed by #3631
Labels
bug Something isn't working

Comments

@sangstar
Copy link
Contributor

sangstar commented Mar 21, 2024

Your current environment

The environment vLLM uses to CI test a PR.

🐛 Describe the bug

Multiple tests for my pull request are failing due to pytest --forked trying to fork new child processes during testing, trying to re-initialize the CUDA context. I'm not sure how other PRs are passing these tests, but I don't know why pytest --forked is explicitly being used when CUDA isn't fork-safe.

This is fine as long as conftest.py doesn't ever initialize CUDA before forking, but this may be difficult to maintain and may be a cause of some sneaky bugs, like from #3487

@sangstar sangstar added the bug Something isn't working label Mar 21, 2024
@sangstar sangstar changed the title [Bug]: Using pytest --forked for tests that use CUDA [Bug]: Using pytest --forked for tests re-initializing CUDA contexts Mar 21, 2024
@sangstar sangstar changed the title [Bug]: Using pytest --forked for tests re-initializing CUDA contexts [Bug]: Using pytest --forked when CUDA is not generally fork-safe Mar 21, 2024
@rkooo567
Copy link
Collaborator

Yeah +1 fixing this issue. @sangstar are you familiar with how --forked works? When does it actually fork a new process for running a test?

@sangstar
Copy link
Contributor Author

sangstar commented Mar 22, 2024

pytest-forked, which allows --forked to be invoked as a CLI arg for pytest, runs each test as a child process of its parent. All child processes forked from it will inherit its memory space, including its CUDA context if it created it before forking. If the subprocess then tries to invoke CUDA, it may still try to create a new CUDA context (mostly due to CUDA contexts being weird about being copied over, as they aren't meant to be), which can lead to inconsistencies or undefined behavior, which is why torch will throw a Runtime Error. CUDA contexts are never supposed to be copied over from one process to another.

For a forked process to use CUDA, its parent process must not have invoked CUDA before, which can be a bit tricky to manage, as this can occur during module imports and no test setups in conftest.py may use it.

@rkooo567
Copy link
Collaborator

Hmm so I guess there are some ideas

  1. we don't use fork at all (in this case, we should ensure cuda related contexts are cleaned up between all tests)
  2. we use fork for all tests (but conftest can still be used) -> not working
  3. we use spawn instead of fork (seems difficult because it is not supported by pytest) -> not working

So basically 1 is the only viable solution? I also wonder how other ML projects handle these issues. Do you happen to know?

@sangstar
Copy link
Contributor Author

sangstar commented Mar 26, 2024

Cleanup fixtures would be a good option to explore. Forking avoids needing cleanup fixtures but it will cause cryptic errors any time a CUDA context is initiated before forking and invoking a test that uses CUDA. Without forking and with a cleanup fixture, each unit test can initiate a CUDA context and then clean it up before the next test.

@rkooo567
Copy link
Collaborator

https://github.com/vllm-project/vllm/pull/3631/files#diff-2fe466060a88bb6a57175df8ca7175849db82a2cf2ba082295d481ab57e58868

It is in progress @sangstar and the result seems pretty promising (all model tests pass except 5 of them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants