Complete qwen2_5_vl, and some fixes #1184

brrr · 2025-03-10T10:27:46Z

Complete: qwen2_5_vl model
Add: examples for qwen2_5_vl
Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None
Fix: unquantLinear forward error when device is cuda but not using batch_matmul

Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda

github-actions · 2025-03-10T10:28:45Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           34           29            0            5
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Makefile                1            6            5            0            1
 Python                 71         3026         2622           81          323
 Shell                   1           58           22           18           18
 Plain Text              3         3723            0         2413         1310
 TOML                   19          530          492            2           36
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               49         4041            0         3069          972
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                16          549          464            0           85
 |- TOML                 2           75           63            0           12
 (Total)                           4901          748         3069         1084
-------------------------------------------------------------------------------
 Rust                  327       107133        95921         2118         9094
 |- Markdown           157         1803           25         1639          139
 (Total)                         108936        95946         3757         9233
===============================================================================
 Total                 492       118718        99236         7713        11769
===============================================================================

MonolithFoundation · 2025-03-10T11:55:55Z

Would like this merged in. Facing Qwen2.5VL issues as well. @EricLBuehler

btw, can we making some common used mod inside mistralrs_core to be public? So that not everyone using mistralrs need clone the source and modify upon it. It could be easier to maintain a seperated repo host various customized models but powered by mistralrs.

EricLBuehler

Hi @brrr - thanks for the great work! I haven't done any testing yet, but it looks good at an initial glance. Could you please fix the cargo fmt and cargo clippy issues?

brrr · 2025-03-11T03:11:24Z

As @MonolithFoundation mentioned, it MAY be helpful to port new models that making some mistral core structure public. During the preocess of qwen 2.5vl porting, I've made some of those public(crate) and wrote a simple model loader and pipeline, but dicarded them before final push. That because:

Most time, SHOULD NOT use core inner structure directly, but use lifetime hooks with some kind of context. Seperated repos using the inner structure directly MAY lead to lots of break changes as long as upstream changing, make them hard to maintance.
Besides lifetime hooks and context, we SHOULD also provide a scaffold which MAY include Tracing and Performance metric like OpenTelemetry, Evals to test each hook function input/output (tensor shape, precision, even somehow to compare or use a persisted tensor such as exported npy file from Transformers)
"Convertion Kits": SHOULD provide OEM Kits including prefix cache, KV cache, attention implement... but also provide a simple way to let downstream change some parts easily. For example a agentic framework with built-in LLM inferrence engine MAY want to use their own prefix and kv cache impletments.
I think it will be the key factor to make Mistralrs massive adoption, plus "Blazingly fast ".

EricLBuehler

Hi @brrr!

I tested this out and it works well! I added a few (relatively minor) comments which should be addressed before merge; otherwise everything looks great.

mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs

EricLBuehler · 2025-03-11T16:10:54Z

mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs

+                let continuous_vid_pad = find_sequences(&ids, vid_pad[0]);
+                all_continuous_vid_pad.push(continuous_vid_pad);
+
+                seq.set_toks(ids);


If we are using set_toks, we need to free & reallocate the sequence for the pagedattn scheduler. Otherwise, problems like #1166 occur.

See:

mistral.rs/mistralrs-core/src/vision_models/qwen2vl/inputs_processor.rs

Lines 398 to 402 in b8237b2

if let Some(ref mut metadata) = paged_attn_metadata {

// Free and then reallocate as appropriate

metadata.block_engine.free_sequence(*seq.id());

metadata.block_engine.allocate(*seq);

}

(we should probably also split this into one that does the free & reallocate automatically, and one like this marked unsafe).

EricLBuehler · 2025-03-11T16:13:36Z

mistralrs-core/src/vision_models/qwen2_5_vl/vision.rs

+            let num_windows_h = (llm_grid_h + pad_h) / vit_merger_window_size;
+            let num_windows_w = (llm_grid_w + pad_w) / vit_merger_window_size;
+            let index_padded = {
+                let h = Tensor::full(-100i32, (t, pad_h, llm_grid_w), &Device::Cpu)?;


Let's avoid using "magic values" - can you make this a const so we can identify it by name?

EricLBuehler · 2025-03-11T16:17:04Z

mistralrs/Cargo.toml

@@ -35,3 +35,8 @@ flash-attn = ["cuda", "mistralrs-core/flash-attn"]
 accelerate = ["mistralrs-core/accelerate"]
 mkl = ["mistralrs-core/mkl"]
 nccl = ["mistralrs-core/nccl"]
+
+[[example]]


I see that without this, the lower-level examples are not directly runnable. For consistency, could you please add corresponding ones for all of the other lower-level examples, or remove this one.

EricLBuehler · 2025-03-11T16:18:38Z

mistralrs/examples/qwen2_5vl/main.rs

+use anyhow::Result;
+use clap::Parser;
+use mistralrs::{
+    IsqType, TextMessageRole, TextMessages, VisionLoaderType, VisionMessages, VisionModelBuilder,


Can you please make it so we always use IsqType::Q8_0 (for these models that is probably reasonable), and remove the usage of TextMessages?

EricLBuehler · 2025-03-11T16:19:44Z

mistralrs/examples/qwen2_5vl/main.rs

+    let image = image::load_from_memory(&bytes)?;
+
+    //force map all layers to gpu
+    let device_mapper = DeviceMapSetting::Map(DeviceMapMetadata::from_num_device_layers(vec![


By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

Done.

And one more thing, when I removed manual device mapping, I want to have a test with the CPU and Gpu(Metal) mixture inference. When without Accelerate feature, matmul will use f16 which MAY cause NaN error. I intended to use clamp_for_f16 to prevent that happening, but I think it's not elegant. @EricLBuehler could you please reconsider a comprehensive refactoring of these code? Especially, when Apple releases new chip which supports native fp8 while raw precision is fp8, it's will still be a problem.

mistral.rs/mistralrs-quant/src/lib.rs

Lines 191 to 199 in b8237b2

#[cfg(not(feature = "accelerate"))]

{

if a.device().is_cpu() {

let original_dtype = a.dtype();

return a

.to_dtype(DType::F16)?

.matmul(&b.to_dtype(DType::F16)?)?

.to_dtype(original_dtype);

} else if !get_use_matmul_via_f16() {

MonolithFoundation · 2025-03-12T02:06:01Z

@EricLBuehler @brrr In this PR #1185 it's port needed components to public, and we written an thirdparty-model zoo including qwen2.5 vl, it works well. https://github.com/lucasjinreal/Namors

It's far more easy to write customized model with this modification.

EricLBuehler and others added 7 commits March 10, 2025 18:07

Implement Qwen 2.5 VL

f525ed7

Reverse window index select

cf67d15

Switch to rmsnorm

cfad0b8

Warn

f685bb6

Fix config, loads now

ea1072e

Fixes

12e2229

EricLBuehler requested changes Mar 10, 2025

View reviewed changes

Fix Rustfmt and Clippy issues

052e7c0

EricLBuehler requested changes Mar 11, 2025

View reviewed changes

Clean code

b4f03ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete qwen2_5_vl, and some fixes #1184

Complete qwen2_5_vl, and some fixes #1184

brrr commented Mar 10, 2025

github-actions bot commented Mar 10, 2025 •

edited

Loading

MonolithFoundation commented Mar 10, 2025

EricLBuehler left a comment

brrr commented Mar 11, 2025 •

edited

Loading

EricLBuehler left a comment

EricLBuehler Mar 11, 2025

brrr Mar 12, 2025

EricLBuehler Mar 11, 2025

brrr Mar 12, 2025

EricLBuehler Mar 11, 2025

brrr Mar 12, 2025

EricLBuehler Mar 11, 2025

brrr Mar 12, 2025

EricLBuehler Mar 11, 2025

brrr Mar 12, 2025 •

edited

Loading

MonolithFoundation commented Mar 12, 2025

	if let Some(ref mut metadata) = paged_attn_metadata {
	// Free and then reallocate as appropriate
	metadata.block_engine.free_sequence(*seq.id());
	metadata.block_engine.allocate(*seq);
	}

	#[cfg(not(feature = "accelerate"))]
	{
	if a.device().is_cpu() {
	let original_dtype = a.dtype();
	return a
	.to_dtype(DType::F16)?
	.matmul(&b.to_dtype(DType::F16)?)?
	.to_dtype(original_dtype);
	} else if !get_use_matmul_via_f16() {

Complete qwen2_5_vl, and some fixes #1184

Are you sure you want to change the base?

Complete qwen2_5_vl, and some fixes #1184

Conversation

brrr commented Mar 10, 2025

github-actions bot commented Mar 10, 2025 • edited Loading

MonolithFoundation commented Mar 10, 2025

EricLBuehler left a comment

Choose a reason for hiding this comment

brrr commented Mar 11, 2025 • edited Loading

EricLBuehler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brrr Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

MonolithFoundation commented Mar 12, 2025

github-actions bot commented Mar 10, 2025 •

edited

Loading

brrr commented Mar 11, 2025 •

edited

Loading

brrr Mar 12, 2025 •

edited

Loading