-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-LoRA as extra models in OpenAI server #2775
Conversation
…in openai server entrypoint
was planning to add a test in |
@jvmncs, thanks for the PR. I think you can change the base model to other that works. Any suggestions? If not we can also create a new test file just to test LoRA support. |
this one looks good: https://huggingface.co/typeof/zephyr-7b-beta-lora |
) | ||
async def test_single_completion(server, client: openai.AsyncOpenAI, | ||
model_name: str): | ||
completion = await client.completions.create(model=model_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason this test was failing for all cases when I switched to MODEL_NAME="mistralai/Mistral-7B-v0.1"
. model was consistently emitting "1999" no matter the prompt/temperature I tried. not really sure why that's the case but reverting to the zephyr model fixed it
latest commit should be ready for a review, assuming CI passes as it did on my machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me, could we add an example of starting an API server with loras?
sure @Yard1, what kind of example were you thinking? other than the command snippet in my original comment, I'm not sure what that would look like |
Hmm, I guess we don't really have an example where we start the server - in that case just extending the docs (https://github.com/vllm-project/vllm/blob/main/docs/source/models/lora.rst) with a snippet of how to run the OpenAI server with lora should be enough! |
Co-authored-by: Chunan Zeng <[email protected]>
…lm-project#2854) Co-authored-by: Roy <[email protected]>
* add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell
Co-authored-by: Roy <[email protected]>
@Yard1 should be good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models
@Yard1 can we do hot swap of lora like without restarting the base model again and changing lora adapters to that base model on the fly. |
@Wizmak9 You must initiate your VLLM only using the OpenAI server, following the guidelines provided in the doumnetation
In the above command just for example loaded two lora modules Now, when making a request, you can specify which LoRA you wish to use. In subsequent requests, you can select different LoRA adapters on the fly, without needing to restart the model. Python code to simulate the request using
|
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models
closes #2600
how to serve the loras (mimicking the multilora inference example):
the above server will list 3 separate values if the user queries
/models
: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgsno work has been done here to scope client permissions to specific models