-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA] Fix context setup for device infos #8124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for explaining the cause ! |
Thanks! Works on multi-gpu node |
56f718f
to
442783a
Compare
Extend the `ScopedContext` to work with just a device, in that case it will simply use the primary context. This is helpful for entry points that only have a `pi_device` and no `pi_context` but that still need some cuda calls that require an active context, such as for the device infos. This addresses a bug where getting the amount of free memory before creating any queues or context, would simply crash. This was partially solved in a previous PR, however the previous PR was releasing the primary context, but leaving it active on the current thread, so getting the device info twice in a row would end up crashing again since it would just use the active but released primary context.
442783a
to
e7e0165
Compare
@intel/llvm-reviewers-cuda does anyone knows what is going on with the CUDA job on this PR? It keeps failing but in the logs it looks like it doesn't run at all, there's nothing past the checkout step. |
I saw the same happen in another PR recently, so it may just be the CI being temperamental. Did your recent rebase use the current intel/llvm sycl branch? |
Yeah I rebased it yesterday and again this morning, still no luck |
https://github.com/intel/llvm/actions/runs/4053625730
I don't see any problems with this job in other pre-commits launched not later than 14 hours ago. |
3864e04
to
1744d31
Compare
After testing it, you are right, it seems this patch is causing issues I had missed, which ends up making the job timeout, I will look into it further, thanks! |
061e2fe
to
1744d31
Compare
@npmiller I was wondering if this test helps too to be included in your PR. I have had it in one of my branches for other functionality |
Yes! We definitely need more testing for this thank you! So I actually know what's the issue, in this patch I originally moved the I've updated this to only use the Which seems to work but is clearly not ideal, so I'm currently evaluating a re-work of the way we handle contexts in the plugin, it's overly complicated, likely inefficient and hacky in some parts. |
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
Extend the
ScopedContext
to work with just a device, in that case it will simply use the primary context.This is helpful for entry points that only have a
pi_device
and nopi_context
but that still need some cuda calls that require an active context, such as for the device infos.This addresses a bug where getting the amount of free memory before creating any queues or context, would simply crash.
This was partially solved in a previous PR (#7906), however the previous PR was releasing the primary context, but leaving it active on the current thread, so getting the device info twice in a row would end up crashing again since it would just use the active but released primary context.
This should address: #8117