Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : second attempt to refactor vision API #11292

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 18, 2025

Fix #8010

Supersede #9687

Important

Please do NOT upload gguf produced via this PR on the internet. People don't know how to use it and they will complain, very annoying!

Then,

cmake --build build -j --target llama-vision
./build/bin/llama-vision -m ../models/llava-1.5-7b-hf/model.gguf --image ../models/bliss.png

# The image showcases a lush green field with a hill in the background. In the foreground, there is a large,
# bright, and vibrant green field with a Microsoft Windows XP desktop screen, possibly representing a
# screensaver, superimposed onto the scene. The field is expansive and covers most of

Goals of this PR:

  • Have the first version of public API for llama_vision
  • Support llava, mobilevlm, minicpm-v 2.6, smolVLM
  • See how API can adapt to use with encoder-decoder like llama 3.2 vision (so we can add it soon)
  • Add API to format the chat, equivalent to Processor class on HF library
  • See how quantizing affect the performance

Things that will be done in follow-up PRs:

  • Models with encoder-decoder arch like llama 3.2 vision
  • GPU support
  • Better image processing function: faster resize function, maybe even abstract out the image transformations and optimize it (example: if we run resize twice, better to detect that and only run it once)
  • Further clean up the mess in convert-hf-to-gguf python script

@github-actions github-actions bot added python python script changes server labels Jan 18, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Jan 19, 2025

Hi @ggerganov @slaren , I would like to ask for an early review from you before proceeding further.

What will be interesting to discuss here is the usage of the new API, as demo in the newly added llama-vision example. The idea is:

  • Call llama_vision_encode for each image (we don't support batching for now, to simplify the implementation)
  • Then, get the output embedding ggml_tensor and add it to llama_batch, then llama_decode it.

I'm already be able to make llava and mobilevlm working with llama-vision and convert_hf_to_gguf.py (for minicpm-v, I'm still struggling with it because the conversion is not straight-forward)

Things that are different from the initial discussion in #8010 :

  • I added a helper function llama_batch_get_one_from_tensor for creating the batch from a tensor, with appropriate n_past (for placing these tokens in the correct place in chat template), and seq_id for future usage in server.
  • llama_vision_patches actually contains slices of image, not patches, as explained in llava-uhd. The patches are actually produced in clip_image_build_graph by doing a ggml_conv_2d. I think I'll need to rename it to llama_vision_slices, but I actually prefer a more appropriate name like llama_vision_preprocessed_img since we do more than just slicing it (i.e. resize, padding, etc) - feel free to suggest if you have any ideas.

And things that are still messy and will need more works:

  1. Naming, most functions are still prefixed by clip_ and I don't know if I should prefix everything with llama_vision_clip_ or not. Please let me know what's your preference.
  2. Chat template support, we may need to introduce a new API that wraps the llama_chat_apply_template, much like how on transformers, they have Processor class that wraps around Tokenizer
  3. Not sure how this API will be adapted for encoder-decoder arch like llama 3.2 vision. In theory, llama_vision_get_output_tensor should become a no-op, but judging from this implementation, it's still needed. @danbev do you have any ideas?

I would love to hear your opinions about this. Thank you!

Comment on lines +862 to +873
if (ctx.ctx_ggml) {
ggml_free(ctx.ctx_ggml);
}
ggml_init_params params = {
/*.mem_size =*/ ggml_tensor_overhead(),
/*.mem_buffer =*/ NULL,
/*.no_alloc =*/ true,
};
ctx.ctx_ggml = ggml_init(params);
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node);
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft);
ggml_backend_tensor_copy(output_node, ctx.output);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Not sure if there is a better way, but I'm using a hacky solution here.

Without a dedicated context (and ggml_backend_tensor_copy), the underlay buffer is realloc before the next llama_decode, rendering the data unusable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the vision part uses the same scheduler than the llama_context, that's unavoidable. You could pre-allocate the tensor in a different buffer to avoid the copy, but that's an optimization that can be done later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.

@slaren
Copy link
Member

slaren commented Jan 20, 2025

  • llama_vision_patches actually contains slices of image, not patches, as explained in llava-uhd. The patches are actually produced in clip_image_build_graph by doing a ggml_conv_2d. I think I'll need to rename it to llama_vision_slices, but I actually prefer a more appropriate name like llama_vision_preprocessed_img since we do more than just slicing it (i.e. resize, padding, etc) - feel free to suggest if you have any ideas.

I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.

@danbev
Copy link
Collaborator

danbev commented Jan 20, 2025

@ngxson I'll take a closer look at this today and specifically how about how this could work with a cross-attention model like Llama 3.2 Vision 👍

One thing that is related to this work is something we discussed about how these models should be provided. I initially though that creating a single .gguf for Llama 3.2 which contained both the vision encoder and the language model would be the way to go, but as can be read in the linked discussion having separate models is probably a better solution. It would be great to get some clarification regarding this and if vision encoders should be separate .gguf models.
I'm looking at updating the conversion for Llama 3.2 and make changes to convert_hf_to_gguf.py to produce 2 models (vision encoder, and language model) instead of one. I'd like to try this out with this latest vision api proposal but I'd prefer to know what the model(s) should look like before proceeding to not waste time.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 20, 2025

@slaren In my first proposal, I made llama_vision_encode to directly accept an image. But then I decide to split it into postprocess-encode because:

  • The most important reason is because user will be able to retrieve the number of tokens that the image occupies (this can varies depends on image size, in case of llava-uhd). This should be done before any decode/encode so that the user can leave the appropriate places for the image after the tokenizing step. This is also similar to Processor class on HF transformers where it returns a preprocessed image and the tokenized prompt with correct number of tokens "placeholder" for image embd.
  • Second reason is that by making this a dedicated function, it's easier to manage error codes. This is mostly because this function work at pixel level, not tensor level.
  • And third reason is because this preprocessing is indeed thread-safe, so for example, llama-server can do this step in HTTP thread, much like how llama_tokenize is currently done in HTTP thread.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 20, 2025

Btw I have been repeatedly mentioned about Processor, so I think it's better to give an example of how it works: https://gist.github.com/ngxson/ca46c72f0cc7b441c30dd85c2a24ee62

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some thoughts that I have so far.

Continuing along the idea for having separate models and contexts for the encoder and the decoder. I think with proper llama_batch abstraction we can have the following API:

// vision
patches0 = llama_vision_tokenize(ctx_enc_v, img0);
patches1 = llama_vision_tokenize(ctx_enc_v, img1);

llama_batch_add_image(batch_enc_v, patches0);
llama_batch_add_image(batch_enc_v, patches1);

llama_encode(ctx_enc_v, batch_enc_v);

embd_enc_v = llama_get_embeddings(ctx_enc_v);

// audio
mel0 = llama_audio_tokenize(ctx_enc_a, audio0);
mel1 = llama_audio_tokenize(ctx_enc_a, audio1);

llama_batch_add_audio(batch_enc_a, mel0);
llama_batch_add_audio(batch_enc_a, mel1);

llama_encode(ctx_enc_a, batch_enc_a);

embd_enc_a = llama_get_embeddings(ctx_enc_a);

// text + vision + audio
tokens0 = llama_tokenize(ctx_dec, tokens0);
tokens1 = llama_tokenize(ctx_dec, tokens1);

llama_batch_add_text      (batch_dec, tokens0);
llama_batch_add_embd_image(batch_dec, embd_enc_v);
llama_batch_add_embd_audio(batch_dec, embd_enc_a);
llama_batch_add_text      (batch_dec, tokens1);

llama_decode(ctx_dec, batch_dec);

For cross-attention models such as Llama 3.2 Vision and Whisper, the decoding context ctx_dec could be initialized with a reference to the encoder context:

llama_context_params cparams_dec;
cparams_dec.ctx_cross[0] = ctx_enc_v;
cparams_dec.ctx_cross[1] = ctx_enc_a;

Edit: extended the example with audio input as well.

Comment on lines 558 to 570
static ggml_cgraph * clip_image_build_graph(clip_context & ctx, int batch_size, clip_image_size & image_size) {
auto & model = *ctx.model;
auto & hparams = ctx.model->hparams;

const int hidden_size = hparams.hidden_size;
const int n_head = hparams.n_head;
const int d_head = hidden_size / n_head;
const int patch_size = hparams.patch_size;
const float eps = hparams.eps;
const int num_patches = ((image_size.width / patch_size) * (image_size.height / patch_size));
const int num_positions = num_patches + (model.class_embedding ? 1 : 0);

LLAMA_LOG_DEBUG("%s: num_patches = %d\n", __func__, num_patches);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clip graph should be constructed as any other graph in src/llama.cpp, llm_build_context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to do this right now, as I can't see how I can re-use existing build_* to make the cgraph of vision models "blend-in" with the rest of llm_build_context

But what I did so far is to make an equivalent called llama_vision_graph_builder. This meant to be a temporary solution, to simplify the migration in the near future.

Could you please have a look on my llama_vision_graph_builder to see how it can be merged into llm_build_context? Thanks!

delete p;
}

int32_t llama_vision_encode(struct llama_context * ctx, llama_vision_patches * p) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need separate function - we should be able to reuse llama_encode.

Copy link
Collaborator Author

@ngxson ngxson Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think we can do this right now, as it requires llama_batch to also accept image tokens.

Do you think it's ok to keep llama_vision_encode(llama_img_tokens &) and refactor llama_batch later on?

Comment on lines +862 to +873
if (ctx.ctx_ggml) {
ggml_free(ctx.ctx_ggml);
}
ggml_init_params params = {
/*.mem_size =*/ ggml_tensor_overhead(),
/*.mem_buffer =*/ NULL,
/*.no_alloc =*/ true,
};
ctx.ctx_ggml = ggml_init(params);
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node);
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft);
ggml_backend_tensor_copy(output_node, ctx.output);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.

Comment on lines 894 to 902
struct llama_vision_patches * llama_vision_patches_init(
struct llama_context * ctx,
llama_vision_bitmap * bmp) {
clip_context & vctx = ctx->vctx;
if (vctx.model->hparams.arch == VISION_ARCH_MINICPMV) {
return new llama_vision_patches(clip_image_preprocess_minicpmv(vctx, *bmp));
}
return new llama_vision_patches(clip_image_preprocess(vctx, *bmp));
}
Copy link
Member

@ggerganov ggerganov Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches, since this would have to be performed on the CPU.

I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.

Even though the user cannot explicitly operate with the patches, it seems to make sense to expose this in order to be able to multi-thread the pre-processing step.

Note that we should also consider the case of Whisper in the context of this abstraction. The whisper model takes raw input audio in PCM format, which is first pre-processed into a mel spectrogram. This pre-processing step, similar to the image pre-processing for CLIP and the text tokenization in text models, is performed on the CPU and can be multi-threaded. Of course, any of the three types of pre-processings could be implemented on the GPU with enough effort, but the important aspect is that this pre-processing can be done in parallel for different inputs and once computed, can be reused with different contexts.

In all cases, the pre-processed input is passed to the transformer graph and the first step is always to convert this input in embeddings. For text, this conversion is trivial - ggml_get_rows(w, tokens). For Whisper, this processes involves a couple of convolutions of the mel spectrogram:

https://github.com/ggerganov/whisper.cpp/blob/99b011a9f5e63f71201bfa583250506453a7b995/src/whisper.cpp#L1904-L1918

For CLIP, this appears to be again a convolution operator applied to the pre-processed input (the image patches) in order to obtain the initial embeddings:

https://github.com/ngxson/llama.cpp/blob/4a7ab89d7593ccb89f80e6e118875ee0b3ede3c7/src/llama-vision.cpp#L581-L616

All these conversions of the pre-processed input (tokens, mel, patches) into the initial embeddings should be implemented in a single place: build_inp_embd().

Copy link
Collaborator Author

@ngxson ngxson Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches

Make sense then. I realized that I was always associate the notion of "token" with "text", but a quick google search tells that: "In LLMs, a token is a basic unit of input or output [...]"

In that sense, I would propose calling it llama_vision_img_tokens (though, it can be a bit confused because user may expect it a std::vector due to the plural "tokens")

// Structure represents the basic input unit of vision model
// This can be a processed image or slices of images under the hood
struct llama_vision_img_tokens;

// User must reserve N number of tokens in tokenized text prompt for each image
int32_t llama_vision_get_n_tokens(const llama_vision_img_tokens * img_tokens);

@danbev
Copy link
Collaborator

danbev commented Jan 22, 2025

@ngxson Sorry about the delay. I've been able to "force" support for mllama using the latest vision api, that is get an example working. I'm now going to iterate on this and try to figure out how cross attention will work. Just wanted to let you know that some progress is being made.

There is an issue I'm having with the vocab size which I'm not exactly sure how to handle. If anyone has some thoughts around this please let me know.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 22, 2025

@danbev No worries, I was busy with minicpm-v too. It's still not working now (inference works, but just missing the llava-uhd preprocessor). Will have a look on your implementation of mllama very soon.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 22, 2025

So, minicpm-v template is more complicated because it contains bot the image and all the slices. Here is what it looks like in minicpmv-cli:

<image> (if no slice, we only have one image) </image><slice><image> (first slice) </image><image> (second slice) </image> .... (n-th slice) </slice>

To get rid of this complication, my idea is to have the embeddings of these tokens (<image>, </image>, <slice> and </slice>) appended into the output tensor returned fromllama_vision_encode.

This will make this formatting transparent to the text tokenizer, but will require embeddings of these tokens to be stored as one-hot vectors in the vision model (of course we can use ggml_get_rows to get them, but will be quite messy)

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 23, 2025

Ok so I managed to get minicpm-v kinda work out of the box with the API (no changes to user-space code is required).

Upon giving it win XP wallpaper bliss, it says: I see a serene landscape featuring a vast expanse of green grass under a clear blue sky

It currently operates with a resized version of the image (like llava), so the performance will be bad for bigger images (with more details). I'll get llava-uhd to work, which breaks the image into slices and thus allow the LLM to "see" the image at different zoom level, thus preserving details.

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 15, 2025

@agNihit928 I think something gets buggy when I rebase to latest master, you can maybe go back to c3a654c to see if it works.

@agNihit928
Copy link

Sure @ngxson
Will check it out
Thanks

@AIWintermuteAI

This comment was marked as outdated.

@AIWintermuteAI

This comment was marked as outdated.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2025

This PR is only tested with SmolVLM 500M: https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct

If you're using another model, I don't know.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2025

Btw, a small reminder so I don't forget:

Important

Please do NOT upload gguf produced via this PR on the internet. People don't know how to use it and they will complain, very annoying!

@agNihit928
Copy link

@AIWintermuteAI
Based on my testing, I was able to generate the GGUF files for both 256M and the 500M models(of the original Hugging Face repos) with the mentioned branch, i.e., c3a654c

@AIWintermuteAI
Copy link

Ah, interesting!
I was using
https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct
which "supposed to be" the same, but has different (broken) config.

@AIWintermuteAI
Copy link

Absolutely, I'm not sharing anything, since I can't even get it to work yet xD

@AIWintermuteAI

This comment was marked as resolved.

@AIWintermuteAI

This comment was marked as resolved.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2025

Try another image / format / resolution. I'd recommend you pinpoint problem on your side first, to prevent spamming this thread with too much data.

And again, nothing is guaranteed to work. This is a WIP.

(I hide your comments because they take too much space, it's hard for me to follow)

@AIWintermuteAI
Copy link

Sure, no worries! I'll use collapsible text next time I need to post large logs, thanks for the reminder.
It's 300x241 pixels image I found when searching for bliss wallpaper on Google. Perhaps you can share your testing sample here?
And again, no worries, I totally understand this is WIP - my comments are just for feedback to you and other people who might be testing this, it's not a nudge :)

I'll try testing with some more images and I guess see what can be done about ValueError: Can not map tensor 'model.text_model.embed_tokens.weight' on the latest commit here. Looks like that .weight part normally is removed, but for some reason it is not.

@AIWintermuteAI
Copy link

AIWintermuteAI commented Mar 1, 2025

Update:
Need to include the instructions and special tokens into the prompt, e.g.

./build/bin/llama-vision --image bliss.png -m ../SmolVLM-500M-Instruct/SmolVLM-500M-Instruct-F16.gguf -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<img_placement>\nwhat do you see?<|im_end|>\n<|im_start|>assistant\n"

Then everything works!

eval text batch (14 tokens)
eval image batch (64 embeddings)
eval text batch (26 tokens)
prompt processed, 90 tokens
The sky is a brilliant blue, dotted with fluffy white clouds that look like cotton candy. The sun is shining, casting a warm glow across the landscape. To the left, there's a small hill, covered in green grass and dotted with wildflowers. The hill is dotted with trees, and the leaves are a rich, dark

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 1, 2025

OK so Phi-4-multimodal-instruct is a bit more messy.

Traditional vision model are simple: just 2 separated transformers, one for vision encoder and one for language decoder. However, on Phi-4 embedding data from vision/audio encoder must also be processed using a dedicated LoRA adapter applied on top of the language decoder

Very technical details

Normal vision models:

flowchart TD
  image --> vision_transformer
  vision_transformer[[vision_transformer]] --> embd_input
  text_input --> embd_input
  embd_input --> text_transformer[[text_transformer]]
  text_transformer --> text_output
Loading

Phi-4 multimodal:

flowchart TD
  image --> vision_transformer[[vision_transformer]]
  vision_transformer --> embd_input
  audio --> audio_transformer[[audio_transformer]]
  audio_transformer --> embd_input
  text_input --> embd_input
  embd_input --> text_transformer
  subgraph text_transformer
    vision_LoRA[[vision_LoRA]]
    audio_LoRA[[audio_LoRA]]
    base_model[[base_model]]
  end
  text_transformer --> text_output
Loading

Diagram from the paper:

image

For now, I've been able to convert only the text/language part. Turns out, it just a simple Phi-4-mini-instruct under the hood, so nothing interesting.

This is also mentioned in the paper:

image

Will see if it's easy to re-implement that LoRA + projectors. Otherwise, we will need to delay Phi-4-multimodal for later.

Update: the LoRA part is very complicated to implement right now, so it will be for a dedicated research/PR in the future.

revert Phi-4-mm since we cannot support LoRA for now, too complicated

@lucasjinreal
Copy link

Hello, does qwen2.5 vl conversion script from raw safetensors into GGUF supported now? Also curious what's the standared way to support a new model in convert_hf_to_gguf.py, it's looks like a little bit tricky needs handle very specific tensor names in various model arch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: Bring back multimodal support
9 participants