-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Dequantize + Fused MoE #323
Comments
Have you got any result about the performance please? I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested. |
The AutoGPTQ kernel is already pretty slow as-is. I have benchmarked the performance against the transformers version, which shows a 3-5x speedup dependent on problem size. I'm not sure I will have time to implement the kernel right now, but very interested in including it in AutoAWQ in the future. Another candidate is looking into the cutlass kernels that Woosuk is working on: |
@chu-tianxiang closing this issue as I have now implemented fused MoE based on your code in vLLM. A new issue has popped up that we can discuss separately: #341 |
Based on the amazing work by @zwd003 and @pcmoritz, a potential strategy for a speedup in AutoAWQ could be to run dequantization and then Fused MoE. I have my doubts if this will give a speedup during decoding, but it's worth a shot since the Triton kernel is already developed and requires minimal effort to integrate.
Ideally, we could dequantize during the Triton kernel to remove any additional overhead and make it as optimized as possible. However, that does require integrating the dequantization code with the Fused MoE code.
Fused MoE kernel: vllm-project/vllm#2542
AWQ Triton Dequantization: https://github.com/vllm-project/vllm/blob/qmm/vllm/model_executor/layers/quantization/ops/awq.py
The text was updated successfully, but these errors were encountered: