AWQ Triton kernels. Make `autoawq-kernels` optional. #608

casper-hansen · 2024-09-10T14:21:35Z

This means that the next version of AutoAWQ will not automatically install the CUDA kernels and that we will prefer Triton. This is to make distribution of AutoAWQ easier and to focus on quantization rather than inference speed (which you should use vLLM to achieve).

Import AWQ Triton kernel from vLLM.
Prefer autoawq-kernels if installed.
Make autoawq-kernels optional and not installed by default.

vince62s · 2024-09-10T14:26:18Z

is the performance loss big ?

casper-hansen · 2024-09-10T14:34:47Z

The performance of the Triton kernel compared to the GEMM kernel is about the same in vLLM. I am doing more testing in AutoAWQ to ensure it translates to this repository. In the future version, fused modules are disabled by default as my focus will mostly be on quantizing new models and test them.

With the latest kernels, it was actually benchmarked to be 10% faster than the GEMM kernel:

casper-hansen · 2024-09-12T09:25:39Z

It's definitely slower in AutoAWQ. This is an acceptable loss in speed for the gain in compatibility. Users will have to install kernels if they wish to see the best speed using native transformers.

Triton

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	91.49	63.96	5.47 GB (23.13%)
1	64	64	2655.54	63.96	5.47 GB (23.13%)
1	128	128	3279.5	63.58	5.47 GB (23.13%)
1	256	256	3634.71	61.61	5.56 GB (23.52%)
1	512	512	3616.49	61.76	5.78 GB (24.43%)
1	1024	1024	3231.94	63.13	6.21 GB (26.27%)
1	2048	2048	8995.4	63.09	7.08 GB (29.93%)
1	4096	4096	8599.73	63.33	8.81 GB (37.26%)

GEMM

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	233.71	81.17	5.47 GB (23.13%)
1	64	64	3096.18	80.75	5.47 GB (23.13%)
1	128	128	4574.06	80.56	5.47 GB (23.13%)
1	256	256	5208.52	80.16	5.56 GB (23.52%)
1	512	512	5212.34	78.28	5.78 GB (24.43%)
1	1024	1024	7057.34	79.24	6.21 GB (26.27%)
1	2048	2048	8357.83	79.04	7.08 GB (29.93%)
1	4096	4096	8313.17	79.22	8.81 GB (37.26%)

casper-hansen added 2 commits September 10, 2024 12:16

initial triton implementation

a06edfe

update install

a2f46e6

casper-hansen changed the title ~~Triton only optional kernels~~ AWQ Triton kernels. Make autoawq-kernels optional. Sep 10, 2024

update model docs

a716323

casper-hansen added 4 commits September 12, 2024 12:12

Improve import system. Disable fused modules if no kernels are found.

92b1d7c

Update generation example

4f3606b

fix typo

ff832b5

Update install instructions

e087843

casper-hansen merged commit ae77736 into main Sep 12, 2024

casper-hansen deleted the triton_only_optional_kernels branch December 30, 2024 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ Triton kernels. Make `autoawq-kernels` optional. #608

AWQ Triton kernels. Make `autoawq-kernels` optional. #608

casper-hansen commented Sep 10, 2024

vince62s commented Sep 10, 2024

casper-hansen commented Sep 10, 2024

casper-hansen commented Sep 12, 2024 •

edited

Loading

AWQ Triton kernels. Make autoawq-kernels optional. #608

AWQ Triton kernels. Make autoawq-kernels optional. #608

Conversation

casper-hansen commented Sep 10, 2024

vince62s commented Sep 10, 2024

casper-hansen commented Sep 10, 2024

casper-hansen commented Sep 12, 2024 • edited Loading

Triton

GEMM

AWQ Triton kernels. Make `autoawq-kernels` optional. #608

AWQ Triton kernels. Make `autoawq-kernels` optional. #608

casper-hansen commented Sep 12, 2024 •

edited

Loading