-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : second attempt to refactor vision API #11292
base: master
Are you sure you want to change the base?
Conversation
Hi @ggerganov @slaren , I would like to ask for an early review from you before proceeding further. What will be interesting to discuss here is the usage of the new API, as demo in the newly added
I'm already be able to make llava and mobilevlm working with Things that are different from the initial discussion in #8010 :
And things that are still messy and will need more works:
I would love to hear your opinions about this. Thank you! |
if (ctx.ctx_ggml) { | ||
ggml_free(ctx.ctx_ggml); | ||
} | ||
ggml_init_params params = { | ||
/*.mem_size =*/ ggml_tensor_overhead(), | ||
/*.mem_buffer =*/ NULL, | ||
/*.no_alloc =*/ true, | ||
}; | ||
ctx.ctx_ggml = ggml_init(params); | ||
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Not sure if there is a better way, but I'm using a hacky solution here.
Without a dedicated context (and ggml_backend_tensor_copy
), the underlay buffer is realloc before the next llama_decode
, rendering the data unusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the vision part uses the same scheduler than the llama_context
, that's unavoidable. You could pre-allocate the tensor in a different buffer to avoid the copy, but that's an optimization that can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call |
@ngxson I'll take a closer look at this today and specifically how about how this could work with a cross-attention model like Llama 3.2 Vision 👍 One thing that is related to this work is something we discussed about how these models should be provided. I initially though that creating a single .gguf for Llama 3.2 which contained both the vision encoder and the language model would be the way to go, but as can be read in the linked discussion having separate models is probably a better solution. It would be great to get some clarification regarding this and if vision encoders should be separate .gguf models. |
@slaren In my first proposal, I made
|
Btw I have been repeatedly mentioned about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some thoughts that I have so far.
Continuing along the idea for having separate models and contexts for the encoder and the decoder. I think with proper llama_batch
abstraction we can have the following API:
// vision
patches0 = llama_vision_tokenize(ctx_enc_v, img0);
patches1 = llama_vision_tokenize(ctx_enc_v, img1);
llama_batch_add_image(batch_enc_v, patches0);
llama_batch_add_image(batch_enc_v, patches1);
llama_encode(ctx_enc_v, batch_enc_v);
embd_enc_v = llama_get_embeddings(ctx_enc_v);
// audio
mel0 = llama_audio_tokenize(ctx_enc_a, audio0);
mel1 = llama_audio_tokenize(ctx_enc_a, audio1);
llama_batch_add_audio(batch_enc_a, mel0);
llama_batch_add_audio(batch_enc_a, mel1);
llama_encode(ctx_enc_a, batch_enc_a);
embd_enc_a = llama_get_embeddings(ctx_enc_a);
// text + vision + audio
tokens0 = llama_tokenize(ctx_dec, tokens0);
tokens1 = llama_tokenize(ctx_dec, tokens1);
llama_batch_add_text (batch_dec, tokens0);
llama_batch_add_embd_image(batch_dec, embd_enc_v);
llama_batch_add_embd_audio(batch_dec, embd_enc_a);
llama_batch_add_text (batch_dec, tokens1);
llama_decode(ctx_dec, batch_dec);
For cross-attention models such as Llama 3.2 Vision and Whisper, the decoding context ctx_dec
could be initialized with a reference to the encoder context:
llama_context_params cparams_dec;
cparams_dec.ctx_cross[0] = ctx_enc_v;
cparams_dec.ctx_cross[1] = ctx_enc_a;
Edit: extended the example with audio input as well.
src/llama-vision.cpp
Outdated
static ggml_cgraph * clip_image_build_graph(clip_context & ctx, int batch_size, clip_image_size & image_size) { | ||
auto & model = *ctx.model; | ||
auto & hparams = ctx.model->hparams; | ||
|
||
const int hidden_size = hparams.hidden_size; | ||
const int n_head = hparams.n_head; | ||
const int d_head = hidden_size / n_head; | ||
const int patch_size = hparams.patch_size; | ||
const float eps = hparams.eps; | ||
const int num_patches = ((image_size.width / patch_size) * (image_size.height / patch_size)); | ||
const int num_positions = num_patches + (model.class_embedding ? 1 : 0); | ||
|
||
LLAMA_LOG_DEBUG("%s: num_patches = %d\n", __func__, num_patches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clip graph should be constructed as any other graph in src/llama.cpp
, llm_build_context
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do this right now, as I can't see how I can re-use existing build_*
to make the cgraph of vision models "blend-in" with the rest of llm_build_context
But what I did so far is to make an equivalent called llama_vision_graph_builder
. This meant to be a temporary solution, to simplify the migration in the near future.
Could you please have a look on my llama_vision_graph_builder
to see how it can be merged into llm_build_context
? Thanks!
src/llama-vision.cpp
Outdated
delete p; | ||
} | ||
|
||
int32_t llama_vision_encode(struct llama_context * ctx, llama_vision_patches * p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need separate function - we should be able to reuse llama_encode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think we can do this right now, as it requires llama_batch
to also accept image tokens.
Do you think it's ok to keep llama_vision_encode(llama_img_tokens &)
and refactor llama_batch
later on?
if (ctx.ctx_ggml) { | ||
ggml_free(ctx.ctx_ggml); | ||
} | ||
ggml_init_params params = { | ||
/*.mem_size =*/ ggml_tensor_overhead(), | ||
/*.mem_buffer =*/ NULL, | ||
/*.no_alloc =*/ true, | ||
}; | ||
ctx.ctx_ggml = ggml_init(params); | ||
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
src/llama-vision.cpp
Outdated
struct llama_vision_patches * llama_vision_patches_init( | ||
struct llama_context * ctx, | ||
llama_vision_bitmap * bmp) { | ||
clip_context & vctx = ctx->vctx; | ||
if (vctx.model->hparams.arch == VISION_ARCH_MINICPMV) { | ||
return new llama_vision_patches(clip_image_preprocess_minicpmv(vctx, *bmp)); | ||
} | ||
return new llama_vision_patches(clip_image_preprocess(vctx, *bmp)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches, since this would have to be performed on the CPU.
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.
Even though the user cannot explicitly operate with the patches, it seems to make sense to expose this in order to be able to multi-thread the pre-processing step.
Note that we should also consider the case of Whisper in the context of this abstraction. The whisper model takes raw input audio in PCM format, which is first pre-processed into a mel spectrogram. This pre-processing step, similar to the image pre-processing for CLIP and the text tokenization in text models, is performed on the CPU and can be multi-threaded. Of course, any of the three types of pre-processings could be implemented on the GPU with enough effort, but the important aspect is that this pre-processing can be done in parallel for different inputs and once computed, can be reused with different contexts.
In all cases, the pre-processed input is passed to the transformer graph and the first step is always to convert this input in embeddings. For text, this conversion is trivial - ggml_get_rows(w, tokens)
. For Whisper, this processes involves a couple of convolutions of the mel spectrogram:
For CLIP, this appears to be again a convolution operator applied to the pre-processed input (the image patches) in order to obtain the initial embeddings:
All these conversions of the pre-processed input (tokens, mel, patches) into the initial embeddings should be implemented in a single place: build_inp_embd()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches
Make sense then. I realized that I was always associate the notion of "token" with "text", but a quick google search tells that: "In LLMs, a token is a basic unit of input or output [...]"
In that sense, I would propose calling it llama_vision_img_tokens
(though, it can be a bit confused because user may expect it a std::vector
due to the plural "tokens")
// Structure represents the basic input unit of vision model
// This can be a processed image or slices of images under the hood
struct llama_vision_img_tokens;
// User must reserve N number of tokens in tokenized text prompt for each image
int32_t llama_vision_get_n_tokens(const llama_vision_img_tokens * img_tokens);
@ngxson Sorry about the delay. I've been able to "force" support for mllama using the latest vision api, that is get an example working. I'm now going to iterate on this and try to figure out how cross attention will work. Just wanted to let you know that some progress is being made. There is an issue I'm having with the vocab size which I'm not exactly sure how to handle. If anyone has some thoughts around this please let me know. |
@danbev No worries, I was busy with minicpm-v too. It's still not working now (inference works, but just missing the llava-uhd preprocessor). Will have a look on your implementation of mllama very soon. |
So, minicpm-v template is more complicated because it contains bot the image and all the slices. Here is what it looks like in
To get rid of this complication, my idea is to have the embeddings of these tokens ( This will make this formatting transparent to the text tokenizer, but will require embeddings of these tokens to be stored as one-hot vectors in the vision model (of course we can use |
Ok so I managed to get minicpm-v kinda work out of the box with the API (no changes to user-space code is required). Upon giving it win XP wallpaper bliss, it says: It currently operates with a resized version of the image (like llava), so the performance will be bad for bigger images (with more details). I'll get llava-uhd to work, which breaks the image into slices and thus allow the LLM to "see" the image at different zoom level, thus preserving details. |
@agNihit928 I think something gets buggy when I rebase to latest master, you can maybe go back to c3a654c to see if it works. |
Sure @ngxson |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This PR is only tested with SmolVLM 500M: https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct If you're using another model, I don't know. |
Btw, a small reminder so I don't forget: Important Please do NOT upload gguf produced via this PR on the internet. People don't know how to use it and they will complain, very annoying! |
@AIWintermuteAI |
Ah, interesting! |
Absolutely, I'm not sharing anything, since I can't even get it to work yet xD |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Try another image / format / resolution. I'd recommend you pinpoint problem on your side first, to prevent spamming this thread with too much data. And again, nothing is guaranteed to work. This is a WIP. (I hide your comments because they take too much space, it's hard for me to follow) |
Sure, no worries! I'll use collapsible text next time I need to post large logs, thanks for the reminder. I'll try testing with some more images and I guess see what can be done about |
Update:
Then everything works!
|
OK so Phi-4-multimodal-instruct is a bit more messy. Traditional vision model are simple: just 2 separated transformers, one for vision encoder and one for language decoder. However, on Phi-4 embedding data from vision/audio encoder must also be processed using a dedicated LoRA adapter applied on top of the language decoder Very technical detailsNormal vision models: flowchart TD
image --> vision_transformer
vision_transformer[[vision_transformer]] --> embd_input
text_input --> embd_input
embd_input --> text_transformer[[text_transformer]]
text_transformer --> text_output
Phi-4 multimodal: flowchart TD
image --> vision_transformer[[vision_transformer]]
vision_transformer --> embd_input
audio --> audio_transformer[[audio_transformer]]
audio_transformer --> embd_input
text_input --> embd_input
embd_input --> text_transformer
subgraph text_transformer
vision_LoRA[[vision_LoRA]]
audio_LoRA[[audio_LoRA]]
base_model[[base_model]]
end
text_transformer --> text_output
Diagram from the paper: For now, I've been able to convert only the text/language part. Turns out, it just a simple Phi-4-mini-instruct under the hood, so nothing interesting. This is also mentioned in the paper:
Update: the LoRA part is very complicated to implement right now, so it will be for a dedicated research/PR in the future. revert Phi-4-mm since we cannot support LoRA for now, too complicated |
Hello, does qwen2.5 vl conversion script from raw safetensors into GGUF supported now? Also curious what's the standared way to support a new model in convert_hf_to_gguf.py, it's looks like a little bit tricky needs handle very specific tensor names in various model arch |
Fix #8010
Supersede #9687
Important
Please do NOT upload gguf produced via this PR on the internet. People don't know how to use it and they will complain, very annoying!
Then,
Goals of this PR:
llama_vision
Processor
class on HF libraryThings that will be done in follow-up PRs: