Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete qwen2_5_vl, and some fixes #1184

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

brrr
Copy link

@brrr brrr commented Mar 10, 2025

Complete: qwen2_5_vl model
Add: examples for qwen2_5_vl
Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None
Fix: unquantLinear forward error when device is cuda but not using batch_matmul

EricLBuehler and others added 7 commits March 10, 2025 18:07
Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision.
It’s hard to figure it out during subsequent debugging
Anyhow, globally setting matnuml precision MAY not be a ideal solution.
 For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs

Qwen2_5vl feature  is functional, start to clean code

Add examples for lower_level_qwen2_5vl

Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None

Clean code

Rebase

Clean code

Fix cuda
Copy link

github-actions bot commented Mar 10, 2025

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           34           29            0            5
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Makefile                1            6            5            0            1
 Python                 71         3026         2622           81          323
 Shell                   1           58           22           18           18
 Plain Text              3         3723            0         2413         1310
 TOML                   19          530          492            2           36
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               49         4041            0         3069          972
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                16          549          464            0           85
 |- TOML                 2           75           63            0           12
 (Total)                           4901          748         3069         1084
-------------------------------------------------------------------------------
 Rust                  327       107133        95921         2118         9094
 |- Markdown           157         1803           25         1639          139
 (Total)                         108936        95946         3757         9233
===============================================================================
 Total                 492       118718        99236         7713        11769
===============================================================================
  

@MonolithFoundation
Copy link

Would like this merged in. Facing Qwen2.5VL issues as well. @EricLBuehler

btw, can we making some common used mod inside mistralrs_core to be public? So that not everyone using mistralrs need clone the source and modify upon it. It could be easier to maintain a seperated repo host various customized models but powered by mistralrs.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brrr - thanks for the great work! I haven't done any testing yet, but it looks good at an initial glance. Could you please fix the cargo fmt and cargo clippy issues?

@brrr
Copy link
Author

brrr commented Mar 11, 2025

As @MonolithFoundation mentioned, it MAY be helpful to port new models that making some mistral core structure public. During the preocess of qwen 2.5vl porting, I've made some of those public(crate) and wrote a simple model loader and pipeline, but dicarded them before final push. That because:

  1. Most time, SHOULD NOT use core inner structure directly, but use lifetime hooks with some kind of context. Seperated repos using the inner structure directly MAY lead to lots of break changes as long as upstream changing, make them hard to maintance.
  2. Besides lifetime hooks and context, we SHOULD also provide a scaffold which MAY include Tracing and Performance metric like OpenTelemetry, Evals to test each hook function input/output (tensor shape, precision, even somehow to compare or use a persisted tensor such as exported npy file from Transformers)
  3. "Convertion Kits": SHOULD provide OEM Kits including prefix cache, KV cache, attention implement... but also provide a simple way to let downstream change some parts easily. For example a agentic framework with built-in LLM inferrence engine MAY want to use their own prefix and kv cache impletments.
    I think it will be the key factor to make Mistralrs massive adoption, plus "Blazingly fast ".

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brrr!

I tested this out and it works well! I added a few (relatively minor) comments which should be addressed before merge; otherwise everything looks great.

let continuous_vid_pad = find_sequences(&ids, vid_pad[0]);
all_continuous_vid_pad.push(continuous_vid_pad);

seq.set_toks(ids);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using set_toks, we need to free & reallocate the sequence for the pagedattn scheduler. Otherwise, problems like #1166 occur.

See:

if let Some(ref mut metadata) = paged_attn_metadata {
// Free and then reallocate as appropriate
metadata.block_engine.free_sequence(*seq.id());
metadata.block_engine.allocate(*seq);
}

(we should probably also split this into one that does the free & reallocate automatically, and one like this marked unsafe).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

let num_windows_h = (llm_grid_h + pad_h) / vit_merger_window_size;
let num_windows_w = (llm_grid_w + pad_w) / vit_merger_window_size;
let index_padded = {
let h = Tensor::full(-100i32, (t, pad_h, llm_grid_w), &Device::Cpu)?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid using "magic values" - can you make this a const so we can identify it by name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -35,3 +35,8 @@ flash-attn = ["cuda", "mistralrs-core/flash-attn"]
accelerate = ["mistralrs-core/accelerate"]
mkl = ["mistralrs-core/mkl"]
nccl = ["mistralrs-core/nccl"]

[[example]]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that without this, the lower-level examples are not directly runnable. For consistency, could you please add corresponding ones for all of the other lower-level examples, or remove this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

use anyhow::Result;
use clap::Parser;
use mistralrs::{
IsqType, TextMessageRole, TextMessages, VisionLoaderType, VisionMessages, VisionModelBuilder,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make it so we always use IsqType::Q8_0 (for these models that is probably reasonable), and remove the usage of TextMessages?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

let image = image::load_from_memory(&bytes)?;

//force map all layers to gpu
let device_mapper = DeviceMapSetting::Map(DeviceMapMetadata::from_num_device_layers(vec![
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

Copy link
Author

@brrr brrr Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

Done.

And one more thing, when I removed manual device mapping, I want to have a test with the CPU and Gpu(Metal) mixture inference. When without Accelerate feature, matmul will use f16 which MAY cause NaN error. I intended to use clamp_for_f16 to prevent that happening, but I think it's not elegant. @EricLBuehler could you please reconsider a comprehensive refactoring of these code? Especially, when Apple releases new chip which supports native fp8 while raw precision is fp8, it's will still be a problem.

#[cfg(not(feature = "accelerate"))]
{
if a.device().is_cpu() {
let original_dtype = a.dtype();
return a
.to_dtype(DType::F16)?
.matmul(&b.to_dtype(DType::F16)?)?
.to_dtype(original_dtype);
} else if !get_use_matmul_via_f16() {

@MonolithFoundation
Copy link

@EricLBuehler @brrr In this PR #1185 it's port needed components to public, and we written an thirdparty-model zoo including qwen2.5 vl, it works well. https://github.com/lucasjinreal/Namors

It's far more easy to write customized model with this modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants