Torchserve 0.8.1: ONNX GPU models not working #2425

dt-subaandh-krishnakumar · 2023-06-23T10:46:59Z

🐛 Describe the bug

I recently updated the torchserve version from 0.7.1-gpu to 0.8.1-gpu.

Current setup

I used torchserve:0.7.1-gpu from the source and build a docker image with torch2.0+cpu. The onnx GPU models and were running and the models used ~8.5 Memory and ~4GB GPU (cuda 11.7).

Bug

After a recent update to torchserve 0.8.1 the torch 2.0+cpu no longer worked and failed with the following error:

[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

I managed to fix this by using torch2.0 with gpu dependencies but doing so increased the Memory(~13GB ) and GPU (~6GB) consumption.

The models were not updated. I built torchserve0.8.1 with ./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38

Error logs

[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Installation instructions

Yes I ran ./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38

FROM torchserve_py38

RUN pip install torch==2.0 (for cpu version --extra-index-url https://download.pytorch.org/whl/cpu)
RUN pip install onnxruntime-gpu==1.13.1
...
...

Model Packaing

I converted pytorch models to onnx and served them using Custom Handler.

config.properties

No response

Versions

Working

Torchserve branch: 

torchserve==0.7.1b20230208
torch-model-archiver==0.7.1b20230208

Python version: 3.8 (64-bit runtime)
Python executable: /usr/bin/python

Versions of relevant python libraries:
captum==0.6.0
intel-extension-for-pytorch==1.13.0
numpy==1.22.2
nvgpu==0.9.0
psutil==5.6.7
pygit2==1.11.1
pylint==2.6.0
pytest==7.2.1
pytest-cov==4.0.0
pytest-mock==3.10.0
requests==2.31.0
requests-toolbelt==0.10.1
sentencepiece==0.1.97
torch==2.0.0+cpu
torch-model-archiver==0.7.1b20230208
torch-workflow-archiver==0.2.7b20230208
torchaudio==0.13.1+cu117
torchserve==0.7.1b20230208
torchtext==0.14.1
torchvision==0.15.1+cpu
transformers==4.10.0
wheel==0.40.0
torch==2.0.0+cpu
torchtext==0.14.1
torchvision==0.15.1+cpu
torchaudio==0.13.1+cu117

Java Version:


OS: Ubuntu 20.04.5 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.26.4

Bug

-----------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.8.1
torch-model-archiver==0.8.1

Python version: 3.8 (64-bit runtime)
Python executable: /home/venv/bin/python

Versions of relevant python libraries:
captum==0.6.0
numpy==1.22.2
nvgpu==0.10.0
psutil==5.6.7
requests==2.31.0
sentencepiece==0.1.97
torch==2.0.0
torch-model-archiver==0.8.1
torch-workflow-archiver==0.2.9
torchaudio==2.0.2+cu117
torchdata==0.6.1
torchserve==0.8.1
torchtext==0.15.2+cpu
torchvision==0.15.1
transformers==4.10.0
wheel==0.40.0
torch==2.0.0
torchtext==0.15.2+cpu
torchvision==0.15.1
torchaudio==2.0.2+cu117

Java Version:


OS: Ubuntu 20.04.6 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.26.4

Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration: 
GPU 0: Tesla T4
Nvidia driver version: 470.161.03
cuDNN version: None

Repro instructions

Issue 1

Convert a PyTorch model to onnx and run it with python3.8, onnxruntime-gpu==1.13.1, torchserve0.7.1-gpu and torch2.0.0+cpu. (Take note of GPU, Memory consumption)
Build a new torchserve image with this command ./build_image.sh -py 3.8 -cv cu117 -g -t torchserve_py38. Run the same model with (onnxruntime-gpu==1.13.1 and torch2.0.0+cpu).

Issue 2

Use torch2.0 instead of torch 2.0+cpu the memory and GPU consumption will increase.

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

msaroufim · 2023-06-23T18:50:56Z

So the logic for whether to use the ONNX CUDA environment is controlled here

https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L80-L84

Both on CPU and GPU our tests were passing so I don't suspect there's a bug with how this is set

Bug 1

What's confusing me about your setup is using both onnx gpu runtime and torch cpu?

[W:onnxruntime:Default, onnxruntime_pybind_state.cc:578 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

It seems to me like what's happening is in your handler your map_location is cuda so make sure that's not not the case https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L112

Bug 2

Are you using the same dependencies for onnx and onnx runtime when measuring the extra memory overhead? torch 2.0 in general has way more dependencies but the overhead you're seeing is significant

I suspect you should be able to repro your errors without torchserve in the loop which will make debugging this a bit easier. Let me know if this all makes sense

dt-subaandh-krishnakumar · 2023-06-26T16:13:33Z

Hello @msaroufim ,

Following your comments, I simplified the requirements and the Docker file to make sure there's nothing wrong with my setup.

Bug 1

I don't understand why I should set the map_location=None when I want my model to use CUDA.

Summary: With the updated setup I still have the error. (Failed to create CUDAExecutionProvider)

My test setup is available here: https://github.com/dt-subaandh-krishnakumar/pytorch_issue. I attached the Logs, Dockerfiles, GPU info. This issue occurs in all onnx models (This is the one used for testing https://huggingface.co/docs/transformers/serialization).

Possible Solution

During my test I observed the following libraries are missing in torchserve 0.8.1 docker image.

nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91

Bug 2

I believe if the Bug 1 is fixed this won't be a problem as I don't need to install torch 2.0 which has lot of other dependencies.

Please let me know if you need any more information.

msaroufim · 2023-06-26T18:30:57Z

Possible Solution
During my test I observed that in torchserve 0.8.1 the following libraries are missing in torchserve 0.8.1 docker image.

This is interesting, tagging @agunapal since a similar issue came up with deepspeed - was not aware ONNX depends on all of this. In that case you can check if your issue goes away if you build a new docker image with the nvidia runtime like so https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143

dt-subaandh-krishnakumar · 2023-06-27T15:00:05Z

Possible Solution
During my test I observed that in torchserve 0.8.1 the following libraries are missing in torchserve 0.8.1 docker image.

This is interesting, tagging @agunapal since a similar issue came up with deepspeed - was not aware ONNX depends on all of this. In that case you can check if your issue goes away if you build a new docker image with the nvidia runtime like so https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143

There were some errors with this https://github.com/pytorch/serve/blob/master/docker/Dockerfile#L6C3-L6C143

In my case, this worked

docker build --file Dockerfile --build-arg BASE_IMAGE=nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 --build-arg PYTHON_VERSION=3.8 -t torchserve:0.8.1  .

but I have a new issue the metrics API returns nothing. Any idea why?

  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/local/nvidia/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
    value(num_of_gpu)
  File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 19, in device_status
    nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

2023-06-27T14:49:12,149 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/local/nvidia/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
    value(num_of_gpu)
  File "/home/venv/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 75, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/venv/lib/python3.8/site-packages/nvgpu/list_gpus.py", line 19, in device_status
    nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/home/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

msaroufim · 2023-06-27T19:42:18Z

Seems like an NVIDIA driver issue - see this for example NVIDIA/k8s-device-plugin#331

Try updating this line https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L46 to nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 and then run the build_image.sh script

dt-subaandh-krishnakumar · 2023-06-27T21:25:16Z

Seems like an NVIDIA driver issue - see this for example NVIDIA/k8s-device-plugin#331

Try updating this line https://github.com/pytorch/serve/blob/master/docker/build_image.sh#L46 to nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu20.04 and then run the build_image.sh script

I updated build_image.sh and rebuilt the image (./build_image.sh -g). I'm able to initialize the models but the Metrics API isn't working(curl http://127.0.0.1:8082/metrics) and returns an empty response.

msaroufim · 2023-06-28T02:39:49Z

Thanks @dt-subaandh-krishnakumar I believe that sounds like a separate issue - tagging @namannandan who owns this

Might make sense to open a seperate issue for this though so we don't lose this

dt-subaandh-krishnakumar · 2023-07-17T17:12:46Z

Fixed in this PR #2435

hungtrieu07 · 2024-01-09T10:14:04Z

I faced this issue, you need to install CUDA 11.8 and corresponding torch version with CUDA 11.8.

msaroufim added the onnx label Jun 23, 2023

msaroufim added the triaged Issue has been reviewed and triaged label Jun 24, 2023

dt-subaandh-krishnakumar changed the title ~~Torchserve 0.8.1: ONNX GPU models not working + uses more resources~~ Torchserve 0.8.1: ONNX GPU models not working Jul 17, 2023

andy971022 mentioned this issue Aug 31, 2023

Can't Deploy Torchserve ONNX with GPU #2559

Open

dt-subaandh-krishnakumar closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchserve 0.8.1: ONNX GPU models not working #2425

Torchserve 0.8.1: ONNX GPU models not working #2425

dt-subaandh-krishnakumar commented Jun 23, 2023 •

edited

Loading

msaroufim commented Jun 23, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jun 26, 2023 •

edited

Loading

msaroufim commented Jun 26, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jun 27, 2023 •

edited

Loading

msaroufim commented Jun 27, 2023

dt-subaandh-krishnakumar commented Jun 27, 2023 •

edited

Loading

msaroufim commented Jun 28, 2023

dt-subaandh-krishnakumar commented Jul 17, 2023 •

edited

Loading

hungtrieu07 commented Jan 9, 2024

Torchserve 0.8.1: ONNX GPU models not working #2425

Torchserve 0.8.1: ONNX GPU models not working #2425

Comments

dt-subaandh-krishnakumar commented Jun 23, 2023 • edited Loading

🐛 Describe the bug

Current setup

Bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Working

Bug

Repro instructions

Issue 1

Issue 2

Possible Solution

msaroufim commented Jun 23, 2023 • edited Loading

Bug 1

Bug 2

dt-subaandh-krishnakumar commented Jun 26, 2023 • edited Loading

Bug 1

Possible Solution

Bug 2

msaroufim commented Jun 26, 2023 • edited Loading

dt-subaandh-krishnakumar commented Jun 27, 2023 • edited Loading

msaroufim commented Jun 27, 2023

dt-subaandh-krishnakumar commented Jun 27, 2023 • edited Loading

msaroufim commented Jun 28, 2023

dt-subaandh-krishnakumar commented Jul 17, 2023 • edited Loading

hungtrieu07 commented Jan 9, 2024

dt-subaandh-krishnakumar commented Jun 23, 2023 •

edited

Loading

msaroufim commented Jun 23, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jun 26, 2023 •

edited

Loading

msaroufim commented Jun 26, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jun 27, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jun 27, 2023 •

edited

Loading

dt-subaandh-krishnakumar commented Jul 17, 2023 •

edited

Loading