Skip to content

[Single File] Add GGUF support #9964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
b5eeaa4
update
DN6 Oct 21, 2024
71897b1
update
DN6 Oct 21, 2024
89ea1ee
update
DN6 Oct 24, 2024
f0bcd94
update
DN6 Oct 24, 2024
60d1385
update
DN6 Oct 29, 2024
22ed0b0
update
DN6 Oct 31, 2024
2e6d340
update
DN6 Nov 3, 2024
b5f927c
update
DN6 Nov 11, 2024
b9666c7
Merge branch 'main' into gguf-support
DN6 Nov 11, 2024
6dc5d22
update
DN6 Nov 13, 2024
428e44b
update
DN6 Nov 15, 2024
d7f09f2
update
DN6 Nov 19, 2024
1649936
update
DN6 Nov 19, 2024
28d3a64
update
DN6 Nov 19, 2024
c34a451
update
DN6 Nov 21, 2024
84493db
update
DN6 Nov 21, 2024
50bd784
update
DN6 Nov 21, 2024
8f604b3
Merge branch 'main' into gguf-support
DN6 Dec 3, 2024
afd5d7d
update
DN6 Dec 4, 2024
e1b964a
Merge branch 'main' into gguf-support
sayakpaul Dec 4, 2024
0ed31bc
update
DN6 Dec 4, 2024
af381ad
update
DN6 Dec 4, 2024
52a1bcb
update
DN6 Dec 4, 2024
66ae46e
Merge branch 'gguf-support' of https://github.com/huggingface/diffuse…
DN6 Dec 4, 2024
67f1700
update
DN6 Dec 4, 2024
8abfa55
update
DN6 Dec 5, 2024
d4b88d7
update
DN6 Dec 5, 2024
30f13ed
update
DN6 Dec 5, 2024
9310035
update
DN6 Dec 5, 2024
e9303a0
update
DN6 Dec 5, 2024
e56c266
update
DN6 Dec 5, 2024
1209c3a
Update src/diffusers/quantizers/gguf/utils.py
DN6 Dec 5, 2024
db9b6f3
update
DN6 Dec 5, 2024
4c0360a
Merge branch 'gguf-support' of https://github.com/huggingface/diffuse…
DN6 Dec 5, 2024
aa7659b
Merge branch 'main' into gguf-support
DN6 Dec 5, 2024
78c7861
update
DN6 Dec 5, 2024
33eb431
update
DN6 Dec 5, 2024
9651ddc
update
DN6 Dec 5, 2024
746fd2f
update
DN6 Dec 5, 2024
e027d46
update
DN6 Dec 5, 2024
9db2396
update
DN6 Dec 6, 2024
7ee89f4
update
DN6 Dec 6, 2024
edf3e54
update
DN6 Dec 6, 2024
d3eb54f
update
DN6 Dec 6, 2024
82606cb
Merge branch 'main' into gguf-support
sayakpaul Dec 9, 2024
4f34f14
Update docs/source/en/quantization/gguf.md
DN6 Dec 11, 2024
090efdb
update
DN6 Dec 11, 2024
391b5a9
Merge branch 'main' into gguf-support
DN6 Dec 17, 2024
e67c25a
update
DN6 Dec 17, 2024
e710bde
update
DN6 Dec 17, 2024
f59e07a
update
DN6 Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/nightly_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,8 @@ jobs:
config:
- backend: "bitsandbytes"
test_location: "bnb"
- backend: "gguf"
test_location: "gguf"
runs-on:
group: aws-g6e-xlarge-plus
container:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@
title: Getting Started
- local: quantization/bitsandbytes
title: bitsandbytes
- local: quantization/gguf
title: gguf
- local: quantization/torchao
title: torchao
title: Quantization Methods
Expand Down
3 changes: 3 additions & 0 deletions docs/source/en/api/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ Learn how to quantize models in the [Quantization](../quantization/overview) gui

[[autodoc]] BitsAndBytesConfig

## GGUFQuantizationConfig

[[autodoc]] GGUFQuantizationConfig
## TorchAoConfig

[[autodoc]] TorchAoConfig
Expand Down
70 changes: 70 additions & 0 deletions docs/source/en/quantization/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# GGUF

The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.

The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.

Before starting please install gguf in your environment

```shell
pip install -U gguf
```

Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].

When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.unint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.

The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original (`numpy`)[https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py] implementation by [compilade](https://github.com/compilade).

```python
import torch

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig

ckpt_path = (
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
)
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
transformer=transformer,
generator=torch.manual_seed(0),
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(prompt).images[0]
image.save("flux-gguf.png")
```

## Supported Quantization Types

- BF16
- Q4_0
- Q4_1
- Q5_0
- Q5_1
- Q8_0
- Q2_K
- Q3_K
- Q4_K
- Q5_K
- Q6_K

9 changes: 7 additions & 2 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Quantization techniques focus on representing data with less information while a

<Tip>

Interested in adding a new quantization method to Transformers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.
Interested in adding a new quantization method to Diffusers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.

</Tip>

Expand All @@ -32,4 +32,9 @@ If you are new to the quantization field, we recommend you to check out these be

## When to use what?

Diffusers supports [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) and [torchao](https://github.com/pytorch/ao). Refer to this [table](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) to help you determine which quantization backend to use.
Diffusers currently supports the following quantization methods.
- [BitsandBytes]()
- [TorchAO]()
- [GGUF]()

[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
4 changes: 2 additions & 2 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"loaders": ["FromOriginalModelMixin"],
"models": [],
"pipelines": [],
"quantizers.quantization_config": ["BitsAndBytesConfig", "TorchAoConfig"],
"quantizers.quantization_config": ["BitsAndBytesConfig", "GGUFQuantizationConfig", "TorchAoConfig"],
"schedulers": [],
"utils": [
"OptionalDependencyNotAvailable",
Expand Down Expand Up @@ -569,7 +569,7 @@

if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .configuration_utils import ConfigMixin
from .quantizers.quantization_config import BitsAndBytesConfig, TorchAoConfig
from .quantizers.quantization_config import BitsAndBytesConfig, GGUFQuantizationConfig, TorchAoConfig

try:
if not is_onnx_available():
Expand Down
46 changes: 44 additions & 2 deletions src/diffusers/loaders/single_file_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@
from contextlib import nullcontext
from typing import Optional

import torch
from huggingface_hub.utils import validate_hf_hub_args

from ..quantizers import DiffusersAutoQuantizer
from ..utils import deprecate, is_accelerate_available, logging
from .single_file_utils import (
SingleFileComponentError,
Expand Down Expand Up @@ -214,6 +216,8 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
subfolder = kwargs.pop("subfolder", None)
revision = kwargs.pop("revision", None)
torch_dtype = kwargs.pop("torch_dtype", None)
quantization_config = kwargs.pop("quantization_config", None)
device = kwargs.pop("device", None)

if isinstance(pretrained_model_link_or_path_or_dict, dict):
checkpoint = pretrained_model_link_or_path_or_dict
Expand All @@ -227,6 +231,12 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
local_files_only=local_files_only,
revision=revision,
)
if quantization_config is not None:
hf_quantizer = DiffusersAutoQuantizer.from_config(quantization_config)
hf_quantizer.validate_environment()

else:
hf_quantizer = None
Comment on lines +234 to +239
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GGUF files, I'm thinking if it would be nice to allow the user to load the model without having necessarily to specify quantization_config=GGUFQuantizationConfig(compute_dtype=xxx). If we detect that this is a gguf, we can set by default quantization_config = GGUFQuantizationConfig(compute_dtype=torch.float32).
I'm suggesting this because usually, when you pass a quantization_config, it means either that the model is not quantized (bnb) or that the model is quantized (there is a quantization_config in the config.json) but we want to change a few arguments.

Also, what happens when the user pass a gguf without specifying the quantization_config ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point! I think for most users, the entrypoint for GGUF files is going to be through from_single_file() and I agree with the logic you mentioned.

Copy link
Collaborator Author

@DN6 DN6 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a nice convenience. GGUF does have all the information we need to auto fetch the config (honestly it's possible to skip the config all together), but it would mean that loading semantics would be different for GGUF vs other quant types. e.g.

GGUF

model = FluxTransformer2DModel.from_single_file("<>.gguf")

BnB and TorchAO (assuming these can be supported):

model = FluxTransformer2DModel.from_single_file("<path>", quantization_config=BnBConfig)
model = FluxTransformer2DModel.from_single_file("<path>", quantization_config=TorchAOConfig)

GGUF can also be used through from_pretrained (assuming quants of diffusers format checkpoints show up as some point) and we would have to pass a quant config in that case. I understand it's not ideal, but I feel it's better to preserve consistency across the different quant loading methods.

@SunMarc if the config isn't passed you get shape mismatch errors when you hit load_model_dict_into_meta since the quant shapes are different from the expected shapes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting this because usually, when you pass a quantization_config, it means either that the model is not quantized (bnb) or that the model is quantized (there is a quantization_config in the config.json) but we want to change a few arguments.

yeah I thought about that too, but I think the API for from_single_file and from_pretrained might just have to be different. It is a bit confusing but I'm not sure if there is a way to make it consistent between from_single_file and from_pertrained, if we also want to make sure the same API is consistent across different quant types

GGUF is a special case here because it has built-in config. Normally, for single-file it is just a checkpoint without config, so you will always have to pass a config (at least I think so, is it? @DN6 ). So for loading a regular quantized model (e.g. BNB) we can load it with from_pretrained without passing a config, but for from_single_file, we will have to manually pass a config

so agree with @DN6 here I think it more important to make the same API (from_pretrained API or from_single_file) consistent for different quant types; if we have to choose one

but if there a way to make it consistent between from_pretrained and from_single_file and across all quant types it will be great!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, want to know this: do we plan to support quantizing a model infrom_single_file? @DN6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUF is a special case here because it has built-in config. Normally, for single-file it is just a checkpoint without config, so you will always have to pass a config (at least I think so, is it? @DN6 ). So for loading a regular quantized model (e.g. BNB) we can load it with from_pretrained without passing a config, but for from_single_file, we will have to manually pass a config

Would it make sense to at least make the user aware when the passed config and the determined config mismatch and if that could lead to unintentional consequences?

also, want to know this: do we plan to support quantizing a model infrom_single_file? @DN6

Supporting quantizing in the GGUF format (regardless of from_pretrained() or from_single_file()) would be reallllly nice.

Copy link
Collaborator Author

@DN6 DN6 Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyixuxu Yeah we can definitely support quantizing a model via single file. For GGUF I can look into in a follow up because we would have to port the quantize functions to torch (the gguf library uses numpy). We could use the gguf library interally to quantize but it's quite slow since we would have to move tensors off GPU, convert to numpy and then quantize.

I think with torch AO I'm pretty sure it would work just out of the box.

You would have to save it with save_pretrained though since we don't support serializing single file checkpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what I am hearing is saving a GGUF quantized model would be added in a follow-up PR? That is also okay but it could be quite an enabling factor for the community.

For GGUF I can look into in a follow up because we would have to port the quantize functions to torch (the gguf library uses numpy). We could use the gguf library interally to quantize but it's quite slow since we would have to move tensors off GPU, convert to numpy and then quantize.

I think the porting option is more preferrable.

I think with torch AO I'm pretty sure it would work just out of the box.

You mean serializing with torchao but with quantization configs similar to the ones provided in GGUF?


mapping_functions = SINGLE_FILE_LOADABLE_CLASSES[mapping_class_name]

Expand Down Expand Up @@ -309,8 +319,36 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
with ctx():
model = cls.from_config(diffusers_model_config)

# Check if `_keep_in_fp32_modules` is not None
use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
(torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
)
if use_keep_in_fp32_modules:
keep_in_fp32_modules = cls._keep_in_fp32_modules
if not isinstance(keep_in_fp32_modules, list):
keep_in_fp32_modules = [keep_in_fp32_modules]

else:
keep_in_fp32_modules = []

if hf_quantizer is not None:
hf_quantizer.preprocess_model(
model=model,
device_map=None,
state_dict=diffusers_format_checkpoint,
keep_in_fp32_modules=keep_in_fp32_modules,
)

if is_accelerate_available():
unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
param_device = torch.device(device) if device else torch.device("cpu")
unexpected_keys = load_model_dict_into_meta(
model,
diffusers_format_checkpoint,
dtype=torch_dtype,
device=param_device,
hf_quantizer=hf_quantizer,
keep_in_fp32_modules=keep_in_fp32_modules,
)

else:
_, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)
Expand All @@ -324,7 +362,11 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
f"Some weights of the model checkpoint were not used when initializing {cls.__name__}: \n {[', '.join(unexpected_keys)]}"
)

if torch_dtype is not None:
if hf_quantizer is not None:
hf_quantizer.postprocess_model(model)
model.hf_quantizer = hf_quantizer

if torch_dtype is not None and hf_quantizer is None:
model.to(torch_dtype)

model.eval()
Expand Down
25 changes: 19 additions & 6 deletions src/diffusers/loaders/single_file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,14 @@
"open_clip_sd3": "text_encoders.clip_g.transformer.text_model.embeddings.position_embedding.weight",
"stable_cascade_stage_b": "down_blocks.1.0.channelwise.0.weight",
"stable_cascade_stage_c": "clip_txt_mapper.weight",
"sd3": "model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
"sd35_large": "model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
"sd3": [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make this change because SD3/3.5 GGUF single file checkpoints use different keys than the original model from SAI..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything special for Flux?

"joint_blocks.0.context_block.adaLN_modulation.1.bias",
"model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
],
"sd35_large": [
"joint_blocks.37.x_block.mlp.fc1.weight",
"model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
],
"animatediff": "down_blocks.0.motion_modules.0.temporal_transformer.transformer_blocks.0.attention_blocks.0.pos_encoder.pe",
"animatediff_v2": "mid_block.motion_modules.0.temporal_transformer.norm.bias",
"animatediff_sdxl_beta": "up_blocks.2.motion_modules.0.temporal_transformer.norm.weight",
Expand Down Expand Up @@ -542,13 +548,20 @@ def infer_diffusers_model_type(checkpoint):
):
model_type = "stable_cascade_stage_b"

elif CHECKPOINT_KEY_NAMES["sd3"] in checkpoint and checkpoint[CHECKPOINT_KEY_NAMES["sd3"]].shape[-1] == 9216:
if checkpoint["model.diffusion_model.pos_embed"].shape[1] == 36864:
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd3"]) and any(
checkpoint[key].shape[-1] == 9216 if key in checkpoint else False for key in CHECKPOINT_KEY_NAMES["sd3"]
):
if "model.diffusion_model.pos_embed" in checkpoint:
key = "model.diffusion_model.pos_embed"
else:
key = "pos_embed"

if checkpoint[key].shape[1] == 36864:
model_type = "sd3"
elif checkpoint["model.diffusion_model.pos_embed"].shape[1] == 147456:
elif checkpoint[key].shape[1] == 147456:
model_type = "sd35_medium"

elif CHECKPOINT_KEY_NAMES["sd35_large"] in checkpoint:
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd35_large"]):
model_type = "sd35_large"

elif CHECKPOINT_KEY_NAMES["animatediff"] in checkpoint:
Expand Down
84 changes: 83 additions & 1 deletion src/diffusers/models/model_loading_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import importlib
import inspect
import os
from array import array
from collections import OrderedDict
from pathlib import Path
from typing import List, Optional, Union
Expand All @@ -26,13 +27,16 @@
from huggingface_hub.utils import EntryNotFoundError

from ..utils import (
GGUF_FILE_EXTENSION,
SAFE_WEIGHTS_INDEX_NAME,
SAFETENSORS_FILE_EXTENSION,
WEIGHTS_INDEX_NAME,
_add_variant,
_get_model_file,
deprecate,
is_accelerate_available,
is_gguf_available,
is_torch_available,
is_torch_version,
logging,
)
Expand Down Expand Up @@ -139,6 +143,8 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike], variant: Optional[
file_extension = os.path.basename(checkpoint_file).split(".")[-1]
if file_extension == SAFETENSORS_FILE_EXTENSION:
return safetensors.torch.load_file(checkpoint_file, device="cpu")
elif file_extension == GGUF_FILE_EXTENSION:
return load_gguf_checkpoint(checkpoint_file)
else:
weights_only_kwarg = {"weights_only": True} if is_torch_version(">=", "1.13") else {}
return torch.load(
Expand Down Expand Up @@ -211,13 +217,14 @@ def load_model_dict_into_meta(
set_module_kwargs["dtype"] = dtype

# bnb params are flattened.
# gguf quants have a different shape based on the type of quantization applied
if empty_state_dict[param_name].shape != param.shape:
if (
is_quantized
and hf_quantizer.pre_quantized
and hf_quantizer.check_if_quantized_param(model, param, param_name, state_dict, param_device=device)
):
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name].shape, param.shape)
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name], param)
else:
model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
raise ValueError(
Expand Down Expand Up @@ -396,3 +403,78 @@ def _fetch_index_file_legacy(
index_file = None

return index_file


def _gguf_parse_value(_value, data_type):
if not isinstance(data_type, list):
data_type = [data_type]
if len(data_type) == 1:
data_type = data_type[0]
array_data_type = None
else:
if data_type[0] != 9:
raise ValueError("Received multiple types, therefore expected the first type to indicate an array.")
data_type, array_data_type = data_type

if data_type in [0, 1, 2, 3, 4, 5, 10, 11]:
_value = int(_value[0])
elif data_type in [6, 12]:
_value = float(_value[0])
elif data_type in [7]:
_value = bool(_value[0])
elif data_type in [8]:
_value = array("B", list(_value)).tobytes().decode()
elif data_type in [9]:
_value = _gguf_parse_value(_value, array_data_type)
return _value


def load_gguf_checkpoint(gguf_checkpoint_path, return_tensors=False):
"""
Load a GGUF file and return a dictionary of parsed parameters containing tensors, the parsed tokenizer and config
attributes.

Args:
gguf_checkpoint_path (`str`):
The path the to GGUF file to load
return_tensors (`bool`, defaults to `True`):
Whether to read the tensors from the file and return them. Not doing so is faster and only loads the
metadata in memory.
"""

if is_gguf_available() and is_torch_available():
import gguf
from gguf import GGUFReader

from ..quantizers.gguf.utils import SUPPORTED_GGUF_QUANT_TYPES, GGUFParameter
else:
logger.error(
"Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check gguf version as well? (in addition to is_gguf_available)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Let's always suggest installing the latest stable build of gguf like we do for bitsandbytes.

if not is_bitsandbytes_available() or is_bitsandbytes_version("<", "0.43.3"):

"https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions."
)
raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.")

reader = GGUFReader(gguf_checkpoint_path)

parsed_parameters = {}
for tensor in reader.tensors:
name = tensor.name
quant_type = tensor.tensor_type

# if the tensor is a torch supported dtype do not use GGUFParameter
is_gguf_quant = quant_type not in [gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create a NON_TORCH_GGUF_DTYPE ENUM or SET with these two values (gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16) and use NON_TORCH_GGUF_DTYPE here, instead.

if is_gguf_quant and quant_type not in SUPPORTED_GGUF_QUANT_TYPES:
_supported_quants_str = "\n".join([str(type) for type in SUPPORTED_GGUF_QUANT_TYPES])
raise ValueError(
(
f"{name} has a quantization type: {str(quant_type)} which is unsupported."
"\n\nCurrently the following quantization types are supported: \n\n"
f"{_supported_quants_str}"
"\n\nTo request support for this quantization type please open an issue here: https://github.com/huggingface/diffusers"
)
)

weights = torch.from_numpy(tensor.data.copy())
parsed_parameters[name] = GGUFParameter(weights, quant_type=quant_type) if is_gguf_quant else weights

return parsed_parameters
Loading
Loading