-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torchserve 0.8.1: ONNX GPU models not working #2425
Comments
So the logic for whether to use the ONNX CUDA environment is controlled here https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L80-L84 Both on CPU and GPU our tests were passing so I don't suspect there's a bug with how this is set Bug 1What's confusing me about your setup is using both onnx gpu runtime and torch cpu?
It seems to me like what's happening is in your handler your Bug 2Are you using the same dependencies for onnx and onnx runtime when measuring the extra memory overhead? torch 2.0 in general has way more dependencies but the overhead you're seeing is significant I suspect you should be able to repro your errors without torchserve in the loop which will make debugging this a bit easier. Let me know if this all makes sense |
Hello @msaroufim , Following your comments, I simplified the requirements and the Docker file to make sure there's nothing wrong with my setup. Bug 1I don't understand why I should set the Summary: With the updated setup I still have the error. (Failed to create CUDAExecutionProvider) My test setup is available here: https://github.com/dt-subaandh-krishnakumar/pytorch_issue. I attached the Logs, Dockerfiles, GPU info. This issue occurs in all onnx models (This is the one used for testing https://huggingface.co/docs/transformers/serialization). Possible SolutionDuring my test I observed the following libraries are missing in torchserve 0.8.1 docker image.
Bug 2I believe if the Bug 1 is fixed this won't be a problem as I don't need to install torch 2.0 which has lot of other dependencies. Please let me know if you need any more information. |
This is interesting, tagging @agunapal since a similar issue came up with deepspeed - was not aware ONNX depends on all of this. In that case you can check if your issue goes away if you build a new docker image with the nvidia runtime like so https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143 |
There were some errors with this https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143 In my case, this worked
but I have a new issue the metrics API returns nothing. Any idea why?
|
Seems like an NVIDIA driver issue - see this for example NVIDIA/k8s-device-plugin#331 Try updating this line https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L46 to |
I updated |
Thanks @dt-subaandh-krishnakumar I believe that sounds like a separate issue - tagging @namannandan who owns this Might make sense to open a seperate issue for this though so we don't lose this |
Fixed in this PR #2435 |
I faced this issue, you need to install CUDA 11.8 and corresponding torch version with CUDA 11.8. |
🐛 Describe the bug
I recently updated the torchserve version from 0.7.1-gpu to 0.8.1-gpu.
Current setup
I used
torchserve:0.7.1-gpu
from the source and build a docker image withtorch2.0+cpu
. The onnx GPU models and were running and the models used ~8.5 Memory and ~4GB GPU (cuda 11.7).Bug
torchserve 0.8.1
thetorch 2.0+cpu
no longer worked and failed with the following error:torch2.0
with gpu dependencies but doing so increased the Memory(~13GB ) and GPU (~6GB) consumption.The models were not updated. I built torchserve0.8.1 with
./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
Error logs
[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
Installation instructions
Yes I ran ./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
Model Packaing
I converted pytorch models to onnx and served them using Custom Handler.
config.properties
No response
Versions
Working
Bug
Repro instructions
Issue 1
python3.8
,onnxruntime-gpu==1.13.1
,torchserve0.7.1-gpu
andtorch2.0.0+cpu
. (Take note of GPU, Memory consumption)./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38
. Run the same model with (onnxruntime-gpu==1.13.1
andtorch2.0.0+cpu
).Issue 2
torch2.0
instead oftorch 2.0+cpu
the memory and GPU consumption will increase.Possible Solution
No response
The text was updated successfully, but these errors were encountered: