Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm development does not work for tensor-parallel > 1 #2619

Closed
lroberts7 opened this issue Jan 26, 2024 · 8 comments · Fixed by #2636
Closed

vllm development does not work for tensor-parallel > 1 #2619

lroberts7 opened this issue Jan 26, 2024 · 8 comments · Fixed by #2636

Comments

@lroberts7
Copy link

lroberts7 commented Jan 26, 2024

I have a local dev build on commit

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 1
commit 5265631d15d59735152c8b72b38d960110987f10 (HEAD -> main, origin/main, origin/HEAD)
Author: Vladimir <[email protected]>
Date:   Fri Jan 26 08:48:17 2024 +0100

    use a correct device when creating OptionalCUDAGuard (#2583)

and I have some local code that is a thin wrapper around LLM class

If i run this with tensor-parallel == 2 I get the following:

roberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
16384
INFO 2024-01-26 22:03:13,343 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:03:13,343 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:03:13,343 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:03:14,921 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:03:16 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:03:16 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-26 22:03:18,650 ERROR services.py:1329 -- Failed to start the dashboard , return code 1
2024-01-26 22:03:18,650 ERROR services.py:1354 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2024-01-26 22:03:18,651 ERROR services.py:1398 -- 
The last 20 lines of /tmp/ray/session_2024-01-26_22-03-16_731996_3725694/logs/dashboard.log (it contains the error message from the dashboard): 
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
    from ray.job_submission import JobStatus, JobSubmissionClient
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
    from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
    from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
    monkeypatch_pydantic_2_for_cloudpickle()
  File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
    pydantic._internal._model_construction.SchemaSerializer = (
AttributeError: module 'pydantic._internal' has no attribute '_model_construction'
2024-01-26 22:03:18,879 INFO worker.py:1673 -- Started a local Ray instance.
[2024-01-26 22:03:19,820 E 3725694 3725694] core_worker.cc:205: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

however, tensor-parallel == 1 works fine:

lroberts@GPU77B9:~/llm_quantization$ FLASK_APP=quantized_flask_app.py FLASK_ENV=debug python3.10 -m flask run 
 * Serving Flask app 'quantized_flask_app.py' (lazy loading)
 * Environment: debug
 * Debug mode: off
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
16384
INFO 2024-01-26 22:04:03,519 abc_etal.py:195 unknown_model_name:unknown_model_version
                             Hello! logging initialized, starting up... 
INFO 2024-01-26 22:04:03,519 abc_etal.py:196 unknown_model_name:unknown_model_version
                             Git commit of model: unknown_git_commit 
INFO 2024-01-26 22:04:03,519 abc_etal.py:197 unknown_model_name:unknown_model_version
                             Git commit of cuda torch base: unknown_git_commit 
INFO 2024-01-26 22:04:05,098 abc_etal.py:200 unknown_model_name:unknown_model_version
                             Compute device available: cuda 
WARNING 01-26 22:04:06 config.py:506] Casting torch.bfloat16 to torch.float16.
WARNING 01-26 22:04:06 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-26 22:04:06 llm_engine.py:72] Initializing an LLM engine with config: model='/home/lroberts/NexusRaven-13B-AWQ/', tokenizer='/home/lroberts/NexusRaven-13B-AWQ/presaved_tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 22:04:23 llm_engine.py:316] # GPU blocks: 4145, # CPU blocks: 327
INFO 01-26 22:04:27 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-26 22:04:27 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-26 22:04:33 model_runner.py:689] Graph capturing finished in 6 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 2024-01-26 22:04:33,205 abc_etal.py:231 unknown_model_name:unknown_model_version
                             Startup completed! 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000 
INFO 2024-01-26 22:04:33,207 _internal.py:224 unknown_model_name:unknown_model_version
                             Press CTRL+C to quit 
[OpenAIMessage(role='system', content='You are a helpful assistant.'), OpenAIMessage(role='user', content='Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ')]
16384
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.11s/it]
INFO 2024-01-26 22:05:05,684 _internal.py:224 unknown_model_name:unknown_model_version
                             127.0.0.1 - - [26/Jan/2024 22:05:05] "POST /sequence-generation/chat/json HTTP/1.1" 200 - 
```bash

the message is a simple curl request looks like this: 
```bash
curl -v --trace-time -X POST -H "Content-Type: application/json" --data '{"max_tokens": 500, "messages": [{"content": "You are a helpful assistant.","role": "system"}, {"content": "Tell me a few reasons why someone might consider higher education. Do not repeat yourself. Response:  ","role": "user"}], "model": "gpt-3.5-turbo", "temperature": 0}' http://localhost:5000/sequence-generation/chat/json

with response:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"  There are many reasons why someone might consider higher education. Here are a few:\n\n1. To gain knowledge and skills: Higher education provides students with the opportunity to learn new knowledge and skills that can be applied in their future careers.\n2. To prepare for a career: Many people choose to pursue higher education because it is a way to prepare for a specific career. For example, a student may choose to study business because they want to work in the field.\n3. To gain a competitive edge: Higher education can provide students with a competitive edge in the job market. Many employers require a degree from a reputable institution, and having one can make a candidate more attractive to potential employers.\n4. To develop critical thinking and problem-solving skills: Higher education provides students with the opportunity to develop their critical thinking and problem-solving skills.\n5. To gain a sense of community: Higher education provides students with the opportunity to connect with other students and faculty members, which can help to create a sense of community.\n6. To gain a sense of purpose: Higher education can provide students with a sense of purpose and direction in life.\n7. To gain a sense of accomplishment: Higher education can provide students with a sense of accomplishment and pride in their achievements.\n8. To gain a sense of personal growth: Higher education can provide students with the opportunity to grow and develop as individuals.\n9. To gain a sense of independence: Higher education can provide students with the opportunity to become independent and self-sufficient.\n10. To gain a sense of fulfillment: Higher education can provide students with a sense of fulfillment and satisfaction in their lives.\n\nOverall, higher education can provide students with a wide range of benefits, including the opportunity to gain knowledge and skills, prepare for a career, gain a competitive edge, develop critical thinking and problem-solving skills, gain a sense of community, gain a sense of purpose, gain a sense of accomplishment, gain a sense of personal growth, gain a sense of independence, and gain a sense of fulfillment.","role":"assistant"}}],"created":1706306706,"id":"llama-2-7b-chat-hf","object":"chat.completion","usage":{"completion_tokens":457,"prompt_tokens":49,"total_tokens":506}}

the error in logs from ray indicates some serialization

 1 2024-01-26 21:35:42,363 INFO utils.py:112 -- Get all modules by type: DashboardHeadModule
  2 2024-01-26 21:35:42,407 INFO utils.py:123 -- Module ray.dashboard.modules.actor.actor_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  3 2024-01-26 21:35:42,429 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_agent cannot be loaded because we cannot import all dependencies. Install this module using `pip in    stall 'ray[default]'` for the full dashboard functionality. Error: No module named 'grpc'
  4 2024-01-26 21:35:42,430 INFO utils.py:123 -- Module ray.dashboard.modules.event.event_head cannot be loaded because we cannot import all dependencies. Install this module using `pip ins    tall 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  5 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_agent cannot be loaded because we cannot import all dependencies. Install this module using `pi    p install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  6 2024-01-26 21:35:42,431 INFO utils.py:123 -- Module ray.dashboard.modules.healthz.healthz_head cannot be loaded because we cannot import all dependencies. Install this module using `pip     install 'ray[default]'` for the full dashboard functionality. Error: No module named 'opencensus'
  7 2024-01-26 21:35:42,450 ERROR dashboard.py:259 -- The dashboard on node GPU77B9 failed with the following error:
  8 Traceback (most recent call last):
  9   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 248, in <module>
 10     loop.run_until_complete(dashboard.run())
 11   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
 12     return future.result()
 13   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/dashboard.py", line 75, in run
 14     await self.dashboard_head.run()
 15   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 325, in run
 16     modules = self._load_modules(self._modules_to_load)
 17   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/head.py", line 219, in _load_modules
 18     head_cls_list = dashboard_utils.get_all_modules(DashboardHeadModule)
 19   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/utils.py", line 121, in get_all_modules
 20     importlib.import_module(name)
 21   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
 22     return _bootstrap._gcd_import(name[level:], package, level)
 23   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
 24   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
 25   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
 26   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
 27   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
 28   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
 29   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 16, in <module>
 30     from ray.job_submission import JobStatus, JobSubmissionClient
 31   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/job_submission/__init__.py", line 2, in <module>
 32     from ray.dashboard.modules.job.pydantic_models import DriverInfo, JobDetails, JobType
 33   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/dashboard/modules/job/pydantic_models.py", line 4, in <module>
 34     from ray._private.pydantic_compat import BaseModel, Field, PYDANTIC_INSTALLED
 35   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 100, in <module>
 36     monkeypatch_pydantic_2_for_cloudpickle()
 37   File "/home/lroberts/.local/lib/python3.10/site-packages/ray/_private/pydantic_compat.py", line 58, in monkeypatch_pydantic_2_for_cloudpickle
 38     pydantic._internal._model_construction.SchemaSerializer = (
 39 AttributeError: module 'pydantic._internal' has no attribute '_model_construction'
 40 
~                                                                                                                                                                                            
~                                                                                        

relevant details about env:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import pydantic; print(pydantic.__version__)"
2.5.3
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import ray; print(ray.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
2.8.0
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import torch; print(torch.__version__)"
2.1.2+cu121

It seems there a known fix or workaround here -> ray-project/ray#41913 (comment)

but it seems that pydantic version 2 is necessary for openai testing

pydantic >= 2.0 # Required for OpenAI server.

is there a suggested workaround or should I manually downgrade pydantic to version lower than 2.0.0?

@lroberts7
Copy link
Author

nvidia device info:

lroberts@GPU77B9:~/llm_quantization$ nvidia-smi 
Fri Jan 26 22:08:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:07:00.0 Off |                    0 |
| N/A   34C    P0               74W / 400W|  62027MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           On | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0               65W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB           On | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0               64W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB           On | 00000000:4D:00.0 Off |                    0 |
| N/A   35C    P0               68W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB           On | 00000000:87:00.0 Off |                    0 |
| N/A   36C    P0               68W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB           On | 00000000:8D:00.0 Off |                    0 |
| N/A   33C    P0               69W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB           On | 00000000:C7:00.0 Off |                    0 |
| N/A   32C    P0               66W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB           On | 00000000:CA:00.0 Off |                    0 |
| N/A   35C    P0               67W / 400W|      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3727132      C   python3.10                                62014MiB |
+---------------------------------------------------------------------------------------+

@yippp
Copy link

yippp commented Jan 28, 2024

also meet module 'pydantic._internal' has no attribute '_model_construction' in the newest dev ver

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 28, 2024

Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.

@yippp
Copy link

yippp commented Jan 28, 2024

Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.

I have tried ray=2.9.1 with newest dev code

vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
but I meet another error
Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 28, 2024

That seems to be a different issue, please open another ticket and I can try reproducing it.

Update, it did try to reproduce it. With the latest main branch:

INFO 01-28 23:03:50 llm_engine.py:317] # GPU blocks: 15130, # CPU blocks: 4096
INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:626] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:52 model_runner.py:630] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=5246) INFO 01-28 23:03:58 model_runner.py:691] Graph capturing finished in 6 secs.
INFO 01-28 23:03:58 serving_chat.py:260] Using default chat template:
INFO 01-28 23:03:58 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [3303]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-28 23:04:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:04:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:18 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:28 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:38 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:48 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:05:58 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:08 llm_engine.py:872] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-28 23:06:10 async_llm_engine.py:431] Received request cmpl-2b94e87fa6e5414b9b2369ec6f77e666: prompt: '<s>[INST] Say this is a test! [/INST]', prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32753, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [1, 1, 733, 16289, 28793, 15753, 456, 349, 264, 1369, 28808, 733, 28748, 16289, 28793], lora_request: None.
INFO 01-28 23:06:11 async_llm_engine.py:110] Finished request cmpl-2b94e87fa6e5414b9b2369ec6f77e666.
INFO:     127.0.0.1:57316 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@lroberts7
Copy link
Author

Looks like this is fixed in Ray 2.9 ray-project/ray#41913 (comment). Try upgrading Ray? We will make sure to lower bound the Ray version as well.

I have tried ray=2.9.1 with newest dev code

vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2 but I meet another error Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped' Segmentation fault (core dumped)

as a work around I think you could pass the flag --disable-custom-all-reduce to not use those custom all-reduce kernels.

@simon-mo
Copy link
Collaborator

I believe #2642 might fix "resource already mapped", please try with latest main, sorry about the back and forth.

@lroberts7
Copy link
Author

@simon-mo This works for me:
python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2

some stdout:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -m vllm.entrypoints.openai.api_server --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2 
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:09 api_server.py:209] args: Namespace(host='0.0.0.0', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 01-29 22:22:09 config.py:177] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-29 22:22:12,853	INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-29 22:22:14 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer='TheBloke/Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
(raylet) /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
(raylet)   warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:22 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-29 22:22:26 llm_engine.py:322] # GPU blocks: 66274, # CPU blocks: 4096
INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:29 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 custom_all_reduce.py:195] Registering 2275 cuda graph addresses
(RayWorkerVllm pid=3935419) INFO 01-29 22:22:33 model_runner.py:698] Graph capturing finished in 4 secs.
INFO 01-29 22:22:34 serving_chat.py:260] Using default chat template:
INFO 01-29 22:22:34 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [3928411]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-29 22:22:44 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:22:54 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:04 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-29 22:23:14 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

also works for tensor-parallel 8 (all the ones on this machine)

specs on A100 machine I'm using:

 NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |

python env:

lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python -c "import vllm, ray, torch, pydantic; print(vllm.__version__); print(ray.__version__); print(torch.__version__); print(pydantic.__version__)"
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
0.2.7
2.9.1
2.1.2+cu121
2.6.0
lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ git log -n 2
commit ea8489fce266d69f2fbe314c1385956b1a342e12 (HEAD -> main, origin/main, origin/HEAD)
Author: Rasmus Larsen <[email protected]>
Date:   Mon Jan 29 19:52:31 2024 +0100

    ROCm: Allow setting compilation target (#2581)

commit 1b20639a43e811f4469e3cfa543cf280d0d76265
Author: Hanzhi Zhou <[email protected]>
Date:   Tue Jan 30 02:46:29 2024 +0800

    No repeated IPC open (#2642)

I don't think the inplace error -> #2620 is resolved though. I still see that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants