multi-LoRA as extra models in OpenAI server #2775

jvmncs · 2024-02-05T22:05:14Z

how to serve the loras (mimicking the multilora inference example):

$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH

the above server will list 3 separate values if the user queries /models: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models

…in openai server entrypoint

jvmncs · 2024-02-05T22:09:39Z

was planning to add a test in tests/entrypoints/test_openai_server.py, but it's unclear which loras are usable with the existing zephyr model from that file. happy to change that out with some different model/lora defaults from huggingface, if that's desirable

simon-mo · 2024-02-05T22:27:49Z

@jvmncs, thanks for the PR. I think you can change the base model to other that works. Any suggestions? If not we can also create a new test file just to test LoRA support.

jvmncs · 2024-02-05T22:44:29Z

this one looks good: https://huggingface.co/typeof/zephyr-7b-beta-lora

jvmncs · 2024-02-06T01:17:56Z

tests/entrypoints/test_openai_server.py

+)
+async def test_single_completion(server, client: openai.AsyncOpenAI,
+                                 model_name: str):
+    completion = await client.completions.create(model=model_name,


for some reason this test was failing for all cases when I switched to MODEL_NAME="mistralai/Mistral-7B-v0.1". model was consistently emitting "1999" no matter the prompt/temperature I tried. not really sure why that's the case but reverting to the zephyr model fixed it

jvmncs · 2024-02-06T01:35:01Z

latest commit should be ready for a review, assuming CI passes as it did on my machine

jvmncs · 2024-02-12T14:47:40Z

@simon-mo @Yard1 bumping this for review

Yard1

This looks pretty good to me, could we add an example of starting an API server with loras?

jvmncs · 2024-02-12T23:33:50Z

could we add an example of starting an API server with loras

sure @Yard1, what kind of example were you thinking? other than the command snippet in my original comment, I'm not sure what that would look like

Yard1 · 2024-02-13T16:17:56Z

Hmm, I guess we don't really have an example where we start the server - in that case just extending the docs (https://github.com/vllm-project/vllm/blob/main/docs/source/models/lora.rst) with a snippet of how to run the OpenAI server with lora should be enough!

… of the model. (vllm-project#2489)

… kv-cache support (vllm-project#2790)

…ntion (vllm-project#2768)

Co-authored-by: Chunan Zeng <[email protected]>

…ct#2851) This reverts commit 5c976a7.

…lm-project#2854) Co-authored-by: Roy <[email protected]>

* add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell

Co-authored-by: Roy <[email protected]>

…lm-project#2863)

jvmncs · 2024-02-15T15:09:56Z

@Yard1 should be good to go

Yard1

Thanks, looks good!

how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models

Wizmak9 · 2024-02-29T22:37:45Z

@Yard1 can we do hot swap of lora like without restarting the base model again and changing lora adapters to that base model on the fly.

ajaychinni · 2024-03-03T03:21:40Z

@Wizmak9 You must initiate your VLLM only using the OpenAI server, following the guidelines provided in the doumnetation

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ sql-lora2=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test2/

In the above command just for example loaded two lora modules sql-lora and sql-lora2

Now, when making a request, you can specify which LoRA you wish to use. In subsequent requests, you can select different LoRA adapters on the fly, without needing to restart the model.

Python code to simulate the request using sql-lora. You can comment out this model and use the base model or another LoRA model in subsequent requests:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    # model =  "meta-llama/Llama-2-7b-hf ",
    model = "sql-lora" ,
    # model = "sql-lora2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport "},
    ]
)
print("Chat response:", chat_response)

how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models

devlup · 2024-08-07T18:10:00Z

@jvmncs @xjpang @Yard1 please refer this bug, it seems it only works with one lora #7169

use lora request in OpenAIServiing; expose lora modules as parameter …

919827a

…in openai server entrypoint

jvmncs force-pushed the openai-lora branch from 253a394 to 919827a Compare February 5, 2024 22:07

simon-mo self-assigned this Feb 5, 2024

Yard1 self-assigned this Feb 5, 2024

jvmncs added 2 commits February 5, 2024 18:48

add single completion test cases for lora adapters

a4881f9

bugfix

613c77a

jvmncs force-pushed the openai-lora branch from fc629fb to 613c77a Compare February 6, 2024 00:50

switch back to zephyr base

19694ae

jvmncs commented Feb 6, 2024

View reviewed changes

add extra lora test case to all openai completion tests

53c096e

Yard1 reviewed Feb 12, 2024

View reviewed changes

pcmoritz mentioned this pull request Feb 13, 2024

Add LoRA support for Mixtral #2831

Merged

WoosukKwon and others added 11 commits February 14, 2024 11:21

[Minor] Fix benchmark_latency script (vllm-project#2765)

b8ffada

[ROCm] Fix some kernels failed unit tests (vllm-project#2498)

3faac81

Set local logging level via env variable (vllm-project#2774)

e6f0009

[ROCm] Fixup arch checks for ROCM (vllm-project#2627)

419b31d

Add fused top-K softmax kernel for MoE (vllm-project#2769)

36017aa

modelscope: fix issue when model parameter is not a model id but path…

78dee0a

… of the model. (vllm-project#2489)

[Minor] More fix of test_cache.py CI test failure (vllm-project#2750)

585846d

[ROCm] Fix build problem resulted from previous commit related to FP8…

5cb2c3a

… kv-cache support (vllm-project#2790)

Add documentation on how to do incremental builds (vllm-project#2796)

e1152b1

[Ray] Integration compiled DAG off by default (vllm-project#2471)

593578c

Disable custom all reduce by default (vllm-project#2808)

5c40715

hongxiayang and others added 15 commits February 14, 2024 11:21

[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-atte…

5d228c1

…ntion (vllm-project#2768)

Add documentation section about LoRA (vllm-project#2834)

2090924

Refactor 2 awq gemm kernels into m16nXk32 (vllm-project#2723)

b440270

Co-authored-by: Chunan Zeng <[email protected]>

Serving Benchmark Refactoring (vllm-project#2433)

57b02d0

[CI] Ensure documentation build is checked in CI (vllm-project#2842)

1f6c168

Refactor llama family models (vllm-project#2637)

9a2cbe1

Revert "Refactor llama family models (vllm-project#2637)" (vllm-proje…

822b463

…ct#2851) This reverts commit 5c976a7.

Use CuPy for CUDA graphs (vllm-project#2811)

299b8cc

Remove Yi model definition, please use LlamaForCausalLM instead (vl…

44b28d2

…lm-project#2854) Co-authored-by: Roy <[email protected]>

Migrate InternLMForCausalLM to LlamaForCausalLM (vllm-project#2860)

70aa7d4

Co-authored-by: Roy <[email protected]>

Fix internlm after vllm-project#2860 (vllm-project#2861)

8c3d97a

[Fix] Fix memory profiling when GPU is used by multiple processes (vl…

5d34102

…lm-project#2863)

append lora serving instructions to lora documentation

35952e4

Merge branch 'main' into openai-lora

a6deb54

Yard1 approved these changes Feb 17, 2024

View reviewed changes

Yard1 merged commit 8f36444 into vllm-project:main Feb 17, 2024
19 checks passed

jvmncs deleted the openai-lora branch February 17, 2024 23:03

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-LoRA as extra models in OpenAI server #2775

multi-LoRA as extra models in OpenAI server #2775

jvmncs commented Feb 5, 2024 •

edited

Loading

jvmncs commented Feb 5, 2024 •

edited

Loading

simon-mo commented Feb 5, 2024

jvmncs commented Feb 5, 2024

jvmncs Feb 6, 2024

jvmncs commented Feb 6, 2024

jvmncs commented Feb 12, 2024

Yard1 left a comment

jvmncs commented Feb 12, 2024 •

edited

Loading

Yard1 commented Feb 13, 2024

jvmncs commented Feb 15, 2024

Yard1 left a comment

Wizmak9 commented Feb 29, 2024

ajaychinni commented Mar 3, 2024

devlup commented Aug 7, 2024

multi-LoRA as extra models in OpenAI server #2775

multi-LoRA as extra models in OpenAI server #2775

Conversation

jvmncs commented Feb 5, 2024 • edited Loading

jvmncs commented Feb 5, 2024 • edited Loading

simon-mo commented Feb 5, 2024

jvmncs commented Feb 5, 2024

jvmncs Feb 6, 2024

Choose a reason for hiding this comment

jvmncs commented Feb 6, 2024

jvmncs commented Feb 12, 2024

Yard1 left a comment

Choose a reason for hiding this comment

jvmncs commented Feb 12, 2024 • edited Loading

Yard1 commented Feb 13, 2024

jvmncs commented Feb 15, 2024

Yard1 left a comment

Choose a reason for hiding this comment

Wizmak9 commented Feb 29, 2024

ajaychinni commented Mar 3, 2024

devlup commented Aug 7, 2024

jvmncs commented Feb 5, 2024 •

edited

Loading

jvmncs commented Feb 5, 2024 •

edited

Loading

jvmncs commented Feb 12, 2024 •

edited

Loading