-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[Single File] Add GGUF support #9964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b5eeaa4
71897b1
89ea1ee
f0bcd94
60d1385
22ed0b0
2e6d340
b5f927c
b9666c7
6dc5d22
428e44b
d7f09f2
1649936
28d3a64
c34a451
84493db
50bd784
8f604b3
afd5d7d
e1b964a
0ed31bc
af381ad
52a1bcb
66ae46e
67f1700
8abfa55
d4b88d7
30f13ed
9310035
e9303a0
e56c266
1209c3a
db9b6f3
4c0360a
aa7659b
78c7861
33eb431
9651ddc
746fd2f
e027d46
9db2396
7ee89f4
edf3e54
d3eb54f
82606cb
4f34f14
090efdb
391b5a9
e67c25a
e710bde
f59e07a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
--> | ||
|
||
# GGUF | ||
|
||
The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported. | ||
|
||
The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant. | ||
|
||
Before starting please install gguf in your environment | ||
|
||
```shell | ||
pip install -U gguf | ||
``` | ||
|
||
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`]. | ||
|
||
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.unint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`. | ||
|
||
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original (`numpy`)[https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py] implementation by [compilade](https://github.com/compilade). | ||
|
||
```python | ||
import torch | ||
|
||
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig | ||
|
||
ckpt_path = ( | ||
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf" | ||
) | ||
transformer = FluxTransformer2DModel.from_single_file( | ||
ckpt_path, | ||
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), | ||
torch_dtype=torch.bfloat16, | ||
) | ||
pipe = FluxPipeline.from_pretrained( | ||
"black-forest-labs/FLUX.1-dev", | ||
transformer=transformer, | ||
generator=torch.manual_seed(0), | ||
torch_dtype=torch.bfloat16, | ||
) | ||
pipe.enable_model_cpu_offload() | ||
prompt = "A cat holding a sign that says hello world" | ||
image = pipe(prompt).images[0] | ||
image.save("flux-gguf.png") | ||
``` | ||
|
||
## Supported Quantization Types | ||
|
||
- BF16 | ||
- Q4_0 | ||
- Q4_1 | ||
- Q5_0 | ||
- Q5_1 | ||
- Q8_0 | ||
- Q2_K | ||
- Q3_K | ||
- Q4_K | ||
- Q5_K | ||
- Q6_K | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -81,8 +81,14 @@ | |
"open_clip_sd3": "text_encoders.clip_g.transformer.text_model.embeddings.position_embedding.weight", | ||
"stable_cascade_stage_b": "down_blocks.1.0.channelwise.0.weight", | ||
"stable_cascade_stage_c": "clip_txt_mapper.weight", | ||
"sd3": "model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias", | ||
"sd35_large": "model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight", | ||
"sd3": [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to make this change because SD3/3.5 GGUF single file checkpoints use different keys than the original model from SAI.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anything special for Flux? |
||
"joint_blocks.0.context_block.adaLN_modulation.1.bias", | ||
"model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias", | ||
], | ||
"sd35_large": [ | ||
"joint_blocks.37.x_block.mlp.fc1.weight", | ||
"model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight", | ||
], | ||
"animatediff": "down_blocks.0.motion_modules.0.temporal_transformer.transformer_blocks.0.attention_blocks.0.pos_encoder.pe", | ||
"animatediff_v2": "mid_block.motion_modules.0.temporal_transformer.norm.bias", | ||
"animatediff_sdxl_beta": "up_blocks.2.motion_modules.0.temporal_transformer.norm.weight", | ||
|
@@ -542,13 +548,20 @@ def infer_diffusers_model_type(checkpoint): | |
): | ||
model_type = "stable_cascade_stage_b" | ||
|
||
elif CHECKPOINT_KEY_NAMES["sd3"] in checkpoint and checkpoint[CHECKPOINT_KEY_NAMES["sd3"]].shape[-1] == 9216: | ||
if checkpoint["model.diffusion_model.pos_embed"].shape[1] == 36864: | ||
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd3"]) and any( | ||
checkpoint[key].shape[-1] == 9216 if key in checkpoint else False for key in CHECKPOINT_KEY_NAMES["sd3"] | ||
): | ||
if "model.diffusion_model.pos_embed" in checkpoint: | ||
key = "model.diffusion_model.pos_embed" | ||
else: | ||
key = "pos_embed" | ||
|
||
if checkpoint[key].shape[1] == 36864: | ||
model_type = "sd3" | ||
elif checkpoint["model.diffusion_model.pos_embed"].shape[1] == 147456: | ||
elif checkpoint[key].shape[1] == 147456: | ||
model_type = "sd35_medium" | ||
|
||
elif CHECKPOINT_KEY_NAMES["sd35_large"] in checkpoint: | ||
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd35_large"]): | ||
model_type = "sd35_large" | ||
|
||
elif CHECKPOINT_KEY_NAMES["animatediff"] in checkpoint: | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -17,6 +17,7 @@ | |||
import importlib | ||||
import inspect | ||||
import os | ||||
from array import array | ||||
from collections import OrderedDict | ||||
from pathlib import Path | ||||
from typing import List, Optional, Union | ||||
|
@@ -26,13 +27,16 @@ | |||
from huggingface_hub.utils import EntryNotFoundError | ||||
|
||||
from ..utils import ( | ||||
GGUF_FILE_EXTENSION, | ||||
SAFE_WEIGHTS_INDEX_NAME, | ||||
SAFETENSORS_FILE_EXTENSION, | ||||
WEIGHTS_INDEX_NAME, | ||||
_add_variant, | ||||
_get_model_file, | ||||
deprecate, | ||||
is_accelerate_available, | ||||
is_gguf_available, | ||||
is_torch_available, | ||||
is_torch_version, | ||||
logging, | ||||
) | ||||
|
@@ -139,6 +143,8 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike], variant: Optional[ | |||
file_extension = os.path.basename(checkpoint_file).split(".")[-1] | ||||
if file_extension == SAFETENSORS_FILE_EXTENSION: | ||||
return safetensors.torch.load_file(checkpoint_file, device="cpu") | ||||
elif file_extension == GGUF_FILE_EXTENSION: | ||||
return load_gguf_checkpoint(checkpoint_file) | ||||
else: | ||||
weights_only_kwarg = {"weights_only": True} if is_torch_version(">=", "1.13") else {} | ||||
return torch.load( | ||||
|
@@ -211,13 +217,14 @@ def load_model_dict_into_meta( | |||
set_module_kwargs["dtype"] = dtype | ||||
|
||||
# bnb params are flattened. | ||||
# gguf quants have a different shape based on the type of quantization applied | ||||
if empty_state_dict[param_name].shape != param.shape: | ||||
if ( | ||||
is_quantized | ||||
and hf_quantizer.pre_quantized | ||||
and hf_quantizer.check_if_quantized_param(model, param, param_name, state_dict, param_device=device) | ||||
): | ||||
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name].shape, param.shape) | ||||
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name], param) | ||||
else: | ||||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else "" | ||||
raise ValueError( | ||||
|
@@ -396,3 +403,78 @@ def _fetch_index_file_legacy( | |||
index_file = None | ||||
|
||||
return index_file | ||||
|
||||
|
||||
def _gguf_parse_value(_value, data_type): | ||||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
if not isinstance(data_type, list): | ||||
data_type = [data_type] | ||||
if len(data_type) == 1: | ||||
data_type = data_type[0] | ||||
array_data_type = None | ||||
else: | ||||
if data_type[0] != 9: | ||||
raise ValueError("Received multiple types, therefore expected the first type to indicate an array.") | ||||
data_type, array_data_type = data_type | ||||
|
||||
if data_type in [0, 1, 2, 3, 4, 5, 10, 11]: | ||||
_value = int(_value[0]) | ||||
elif data_type in [6, 12]: | ||||
_value = float(_value[0]) | ||||
elif data_type in [7]: | ||||
_value = bool(_value[0]) | ||||
elif data_type in [8]: | ||||
_value = array("B", list(_value)).tobytes().decode() | ||||
elif data_type in [9]: | ||||
_value = _gguf_parse_value(_value, array_data_type) | ||||
return _value | ||||
|
||||
|
||||
def load_gguf_checkpoint(gguf_checkpoint_path, return_tensors=False): | ||||
""" | ||||
Load a GGUF file and return a dictionary of parsed parameters containing tensors, the parsed tokenizer and config | ||||
attributes. | ||||
|
||||
Args: | ||||
gguf_checkpoint_path (`str`): | ||||
The path the to GGUF file to load | ||||
return_tensors (`bool`, defaults to `True`): | ||||
Whether to read the tensors from the file and return them. Not doing so is faster and only loads the | ||||
metadata in memory. | ||||
""" | ||||
|
||||
if is_gguf_available() and is_torch_available(): | ||||
import gguf | ||||
from gguf import GGUFReader | ||||
|
||||
from ..quantizers.gguf.utils import SUPPORTED_GGUF_QUANT_TYPES, GGUFParameter | ||||
else: | ||||
logger.error( | ||||
"Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see " | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to check gguf version as well? (in addition to is_gguf_available) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Let's always suggest installing the latest stable build of
|
||||
"https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions." | ||||
) | ||||
raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.") | ||||
|
||||
reader = GGUFReader(gguf_checkpoint_path) | ||||
|
||||
parsed_parameters = {} | ||||
for tensor in reader.tensors: | ||||
name = tensor.name | ||||
quant_type = tensor.tensor_type | ||||
|
||||
# if the tensor is a torch supported dtype do not use GGUFParameter | ||||
is_gguf_quant = quant_type not in [gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16] | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could create a |
||||
if is_gguf_quant and quant_type not in SUPPORTED_GGUF_QUANT_TYPES: | ||||
_supported_quants_str = "\n".join([str(type) for type in SUPPORTED_GGUF_QUANT_TYPES]) | ||||
raise ValueError( | ||||
( | ||||
f"{name} has a quantization type: {str(quant_type)} which is unsupported." | ||||
"\n\nCurrently the following quantization types are supported: \n\n" | ||||
f"{_supported_quants_str}" | ||||
"\n\nTo request support for this quantization type please open an issue here: https://github.com/huggingface/diffusers" | ||||
) | ||||
) | ||||
|
||||
weights = torch.from_numpy(tensor.data.copy()) | ||||
parsed_parameters[name] = GGUFParameter(weights, quant_type=quant_type) if is_gguf_quant else weights | ||||
|
||||
return parsed_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GGUF files, I'm thinking if it would be nice to allow the user to load the model without having necessarily to specify
quantization_config=GGUFQuantizationConfig(compute_dtype=xxx)
. If we detect that this is a gguf, we can set by defaultquantization_config = GGUFQuantizationConfig(compute_dtype=torch.float32)
.I'm suggesting this because usually, when you pass a
quantization_config
, it means either that the model is not quantized (bnb) or that the model is quantized (there is a quantization_config in the config.json) but we want to change a few arguments.Also, what happens when the user pass a gguf without specifying the
quantization_config
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good point! I think for most users, the entrypoint for GGUF files is going to be through
from_single_file()
and I agree with the logic you mentioned.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a nice convenience. GGUF does have all the information we need to auto fetch the config (honestly it's possible to skip the config all together), but it would mean that loading semantics would be different for GGUF vs other quant types. e.g.
GGUF
BnB and TorchAO (assuming these can be supported):
GGUF can also be used through
from_pretrained
(assuming quants of diffusers format checkpoints show up as some point) and we would have to pass a quant config in that case. I understand it's not ideal, but I feel it's better to preserve consistency across the different quant loading methods.@SunMarc if the config isn't passed you get shape mismatch errors when you hit
load_model_dict_into_meta
since the quant shapes are different from the expected shapes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I thought about that too, but I think the API for
from_single_file
andfrom_pretrained
might just have to be different. It is a bit confusing but I'm not sure if there is a way to make it consistent betweenfrom_single_file
andfrom_pertrained
, if we also want to make sure the same API is consistent across different quant typesGGUF is a special case here because it has built-in config. Normally, for single-file it is just a checkpoint without config, so you will always have to pass a config (at least I think so, is it? @DN6 ). So for loading a regular quantized model (e.g. BNB) we can load it with
from_pretrained
without passing a config, but forfrom_single_file
, we will have to manually pass a configso agree with @DN6 here I think it more important to make the same API (
from_pretrained API
orfrom_single_file
) consistent for different quant types; if we have to choose onebut if there a way to make it consistent between from_pretrained and from_single_file and across all quant types it will be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, want to know this: do we plan to support quantizing a model in
from_single_file
? @DN6There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to at least make the user aware when the passed config and the determined config mismatch and if that could lead to unintentional consequences?
Supporting quantizing in the GGUF format (regardless of
from_pretrained()
orfrom_single_file()
) would be reallllly nice.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu Yeah we can definitely support quantizing a model via single file. For GGUF I can look into in a follow up because we would have to port the quantize functions to torch (the gguf library uses numpy). We could use the gguf library interally to quantize but it's quite slow since we would have to move tensors off GPU, convert to numpy and then quantize.
I think with torch AO I'm pretty sure it would work just out of the box.
You would have to save it with
save_pretrained
though since we don't support serializing single file checkpoints.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what I am hearing is saving a GGUF quantized model would be added in a follow-up PR? That is also okay but it could be quite an enabling factor for the community.
I think the porting option is more preferrable.
You mean serializing with
torchao
but with quantization configs similar to the ones provided in GGUF?