Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ Triton kernels. Make autoawq-kernels optional. #608

Merged
merged 7 commits into from
Sep 12, 2024

Conversation

casper-hansen
Copy link
Owner

This means that the next version of AutoAWQ will not automatically install the CUDA kernels and that we will prefer Triton. This is to make distribution of AutoAWQ easier and to focus on quantization rather than inference speed (which you should use vLLM to achieve).

  • Import AWQ Triton kernel from vLLM.
  • Prefer autoawq-kernels if installed.
  • Make autoawq-kernels optional and not installed by default.

@casper-hansen casper-hansen changed the title Triton only optional kernels AWQ Triton kernels. Make autoawq-kernels optional. Sep 10, 2024
@vince62s
Copy link

is the performance loss big ?

@casper-hansen
Copy link
Owner Author

The performance of the Triton kernel compared to the GEMM kernel is about the same in vLLM. I am doing more testing in AutoAWQ to ensure it translates to this repository. In the future version, fused modules are disabled by default as my focus will mostly be on quantizing new models and test them.

With the latest kernels, it was actually benchmarked to be 10% faster than the GEMM kernel:

image

@casper-hansen
Copy link
Owner Author

casper-hansen commented Sep 12, 2024

It's definitely slower in AutoAWQ. This is an acceptable loss in speed for the gain in compatibility. Users will have to install kernels if they wish to see the best speed using native transformers.

Triton

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 91.49 63.96 5.47 GB (23.13%)
1 64 64 2655.54 63.96 5.47 GB (23.13%)
1 128 128 3279.5 63.58 5.47 GB (23.13%)
1 256 256 3634.71 61.61 5.56 GB (23.52%)
1 512 512 3616.49 61.76 5.78 GB (24.43%)
1 1024 1024 3231.94 63.13 6.21 GB (26.27%)
1 2048 2048 8995.4 63.09 7.08 GB (29.93%)
1 4096 4096 8599.73 63.33 8.81 GB (37.26%)

GEMM

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 233.71 81.17 5.47 GB (23.13%)
1 64 64 3096.18 80.75 5.47 GB (23.13%)
1 128 128 4574.06 80.56 5.47 GB (23.13%)
1 256 256 5208.52 80.16 5.56 GB (23.52%)
1 512 512 5212.34 78.28 5.78 GB (24.43%)
1 1024 1024 7057.34 79.24 6.21 GB (26.27%)
1 2048 2048 8357.83 79.04 7.08 GB (29.93%)
1 4096 4096 8313.17 79.22 8.81 GB (37.26%)

@casper-hansen casper-hansen merged commit ae77736 into main Sep 12, 2024
@casper-hansen casper-hansen deleted the triton_only_optional_kernels branch December 30, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants