-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete qwen2_5_vl, and some fixes #1184
base: master
Are you sure you want to change the base?
Conversation
Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 34 29 0 5 Dockerfile 1 41 22 10 9 JSON 12 105 104 0 1 Makefile 1 6 5 0 1 Python 71 3026 2622 81 323 Shell 1 58 22 18 18 Plain Text 3 3723 0 2413 1310 TOML 19 530 492 2 36 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 49 4041 0 3069 972 |- BASH 6 103 100 0 3 |- JSON 1 12 12 0 0 |- Python 7 121 109 0 12 |- Rust 16 549 464 0 85 |- TOML 2 75 63 0 12 (Total) 4901 748 3069 1084 ------------------------------------------------------------------------------- Rust 327 107133 95921 2118 9094 |- Markdown 157 1803 25 1639 139 (Total) 108936 95946 3757 9233 =============================================================================== Total 492 118718 99236 7713 11769 =============================================================================== |
Would like this merged in. Facing Qwen2.5VL issues as well. @EricLBuehler btw, can we making some common used mod inside mistralrs_core to be public? So that not everyone using mistralrs need clone the source and modify upon it. It could be easier to maintain a seperated repo host various customized models but powered by mistralrs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brrr - thanks for the great work! I haven't done any testing yet, but it looks good at an initial glance. Could you please fix the cargo fmt
and cargo clippy
issues?
As @MonolithFoundation mentioned, it MAY be helpful to port new models that making some mistral core structure public. During the preocess of qwen 2.5vl porting, I've made some of those public(crate) and wrote a simple model loader and pipeline, but dicarded them before final push. That because:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brrr!
I tested this out and it works well! I added a few (relatively minor) comments which should be addressed before merge; otherwise everything looks great.
let continuous_vid_pad = find_sequences(&ids, vid_pad[0]); | ||
all_continuous_vid_pad.push(continuous_vid_pad); | ||
|
||
seq.set_toks(ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using set_toks
, we need to free & reallocate the sequence for the pagedattn scheduler. Otherwise, problems like #1166 occur.
See:
mistral.rs/mistralrs-core/src/vision_models/qwen2vl/inputs_processor.rs
Lines 398 to 402 in b8237b2
if let Some(ref mut metadata) = paged_attn_metadata { | |
// Free and then reallocate as appropriate | |
metadata.block_engine.free_sequence(*seq.id()); | |
metadata.block_engine.allocate(*seq); | |
} |
(we should probably also split this into one that does the free & reallocate automatically, and one like this marked unsafe
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
let num_windows_h = (llm_grid_h + pad_h) / vit_merger_window_size; | ||
let num_windows_w = (llm_grid_w + pad_w) / vit_merger_window_size; | ||
let index_padded = { | ||
let h = Tensor::full(-100i32, (t, pad_h, llm_grid_w), &Device::Cpu)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using "magic values" - can you make this a const
so we can identify it by name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mistralrs/Cargo.toml
Outdated
@@ -35,3 +35,8 @@ flash-attn = ["cuda", "mistralrs-core/flash-attn"] | |||
accelerate = ["mistralrs-core/accelerate"] | |||
mkl = ["mistralrs-core/mkl"] | |||
nccl = ["mistralrs-core/nccl"] | |||
|
|||
[[example]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that without this, the lower-level examples are not directly runnable. For consistency, could you please add corresponding ones for all of the other lower-level examples, or remove this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
mistralrs/examples/qwen2_5vl/main.rs
Outdated
use anyhow::Result; | ||
use clap::Parser; | ||
use mistralrs::{ | ||
IsqType, TextMessageRole, TextMessages, VisionLoaderType, VisionMessages, VisionModelBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make it so we always use IsqType::Q8_0
(for these models that is probably reasonable), and remove the usage of TextMessages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mistralrs/examples/qwen2_5vl/main.rs
Outdated
let image = image::load_from_memory(&bytes)?; | ||
|
||
//force map all layers to gpu | ||
let device_mapper = DeviceMapSetting::Map(DeviceMapMetadata::from_num_device_layers(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?
Done.
And one more thing, when I removed manual device mapping, I want to have a test with the CPU and Gpu(Metal) mixture inference. When without Accelerate feature, matmul will use f16 which MAY cause NaN error. I intended to use clamp_for_f16 to prevent that happening, but I think it's not elegant. @EricLBuehler could you please reconsider a comprehensive refactoring of these code? Especially, when Apple releases new chip which supports native fp8 while raw precision is fp8, it's will still be a problem.
mistral.rs/mistralrs-quant/src/lib.rs
Lines 191 to 199 in b8237b2
#[cfg(not(feature = "accelerate"))] | |
{ | |
if a.device().is_cpu() { | |
let original_dtype = a.dtype(); | |
return a | |
.to_dtype(DType::F16)? | |
.matmul(&b.to_dtype(DType::F16)?)? | |
.to_dtype(original_dtype); | |
} else if !get_use_matmul_via_f16() { |
@EricLBuehler @brrr In this PR #1185 it's port needed components to public, and we written an thirdparty-model zoo including qwen2.5 vl, it works well. https://github.com/lucasjinreal/Namors It's far more easy to write customized model with this modification. |
Complete: qwen2_5_vl model
Add: examples for qwen2_5_vl
Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None
Fix: unquantLinear forward error when device is cuda but not using batch_matmul