Skip to content

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

License

Notifications You must be signed in to change notification settings

thu-ml/SpargeAttn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparge Attention

This repository provides the official implementation of SpargeAttn.

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper: https://arxiv.org/abs/2502.18137
Jintao Zhang, Chendong Xiang, Haofeng Huang, Haocheng Xi, Jia Wei, Jun Zhu, Jianfei Chen

speed comparison.

overview.

Installation

Base environment

  • python>=3.9 , torch>=2.3.0
  • CUDA:
    • >=12.8 for Blackwell
    • >=12.4 for fp8 support on Ada
    • >=12.3 for fp8 support on Hopper
    • >=12.0 for Ampere

Install Package

python setup.py install   # or pip install -e .

Avalible API

Usage Examples

CogVideoX

Tuning:

python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune

Inference:

python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt

Note: We provide pre-tuned hyper-parameters CogVideoX-2b_0.06_0.07.pt that allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.

LLama

The tuning and inference usage is similar to CogVideoX.

Supported models

Here’s a list of the model modifications we’ve implemented so far. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀

model name example script tuned ckpt
CogVideoX evaluate/cogvideo_example.py evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt
Flux evaluate/flux_example.py TBD

Performance

Local Image

Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.


End-to-end video generation on Mochi.
The quality of video generation on Mochi.
End-to-end performance of NIAH.
End-to-end performance of NIAH.

Citation

If you use this code or find our work valuable, please cite:

@misc{zhang2025spargeattn,
      title={SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference}, 
      author={Jintao Zhang and Chendong Xiang and Haofeng Huang and Jia Wei and Haocheng Xi and Jun Zhu and Jianfei Chen},
      year={2025},
      eprint={2502.18137},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.18137}, 
}

@inproceedings{zhang2025sageattention,
      title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
      author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
      booktitle={International Conference on Learning Representations (ICLR)},
      year={2025}
}

@misc{zhang2024sageattention2,
      title={SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization}, 
      author={Jintao Zhang and Haofeng Huang and Pengle Zhang and Jia Wei and Jun Zhu and Jianfei Chen},
      year={2024},
      eprint={2411.10958},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.10958}, 
}