Fused attention: Switch to Flash Decoding #656

casper-hansen · 2024-11-26T16:36:31Z

Current implementation

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)    |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:-----------------|
|            1 |               32 |              32 |             213.47 |             97.07 | 5.48 GB (23.16%) |
|            1 |               64 |              64 |            3985.32 |             96.23 | 5.48 GB (23.20%) |
|            1 |              128 |             128 |            4977.39 |             94.95 | 5.50 GB (23.27%) |
|            1 |              256 |             256 |            5416.4  |             94.45 | 5.54 GB (23.42%) |
|            1 |              512 |             512 |            5403.73 |             93.73 | 5.64 GB (23.87%) |
|            1 |             1024 |            1024 |            7218.92 |             92.74 | 5.89 GB (24.90%) |
|            1 |             2048 |            2048 |            7684.16 |             83.76 | 6.43 GB (27.21%) |
|            1 |             4096 |            4096 |            7308.05 |             59.79 | 7.52 GB (31.82%) |

With `flash-attn`

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)    |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:-----------------|
|            1 |               32 |              32 |             232.19 |            107.17 | 5.48 GB (23.16%) |
|            1 |               64 |              64 |            4030.56 |            106.3  | 5.48 GB (23.20%) |
|            1 |              128 |             128 |            5044.31 |            104.98 | 5.50 GB (23.27%) |
|            1 |              256 |             256 |            5290.12 |            104.99 | 5.54 GB (23.42%) |
|            1 |              512 |             512 |            5457.14 |            104.55 | 5.64 GB (23.87%) |
|            1 |             1024 |            1024 |            7465.82 |            104.06 | 5.88 GB (24.85%) |
|            1 |             2048 |            2048 |            8284.3  |            104.03 | 6.41 GB (27.12%) |
|            1 |             4096 |            4096 |            8487.37 |            103.77 | 7.48 GB (31.63%) |

vince62s · 2024-12-19T09:57:00Z

@casper-hansen
I just ran the benchmark on my machine (last kernels, last awq, last flash 2.7.2.post1)

Big difference in tok/sec so my guess is just the CPU (I got a ryzen 59XX)
Also you can see my VRAM report matches your numbers but for the last line as an example 7.48G, nvidia-smi tops at 9060MB, not sure about what you or the other actually reports.

 -- Loading model...
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 22692.36it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00, 10.58it/s]
Fusing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 516.61it/s]
 -- Warming up...
 -- Generating 4096 tokens, 4096 in context...
 ** Speed (Prefill): 8245.03 tokens/second
 ** Speed (Decode): 59.22 tokens/second
 ** Max Memory (device: 0): 7.48 GB (31.63%)
Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)    |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:-----------------|
|            1 |              256 |             256 |            1007.16 |             59.57 | 5.54 GB (23.42%) |
|            1 |              512 |             512 |            5284.45 |             60.71 | 5.64 GB (23.85%) |
|            1 |             1024 |            1024 |            6831.32 |             58.5  | 5.88 GB (24.85%) |
|            1 |             2048 |            2048 |            7676.86 |             58.04 | 6.41 GB (27.12%) |
|            1 |             4096 |            4096 |            8245.03 |             59.22 | 7.48 GB (31.63%) |

casper-hansen · 2024-12-19T10:02:09Z

Yes, the difference is the CPU. I had a Ryzen 9 7950x for this benchmark.

casper-hansen added 5 commits November 24, 2024 11:10

initial refactor

576078e

generation is coherent

79d9f03

flash_attn_with_kvcache for decoding

8915317

cleanup + formatting

0913bc2

add flash-attn to kernels extras

902779f

casper-hansen merged commit dfe396a into main Nov 26, 2024

casper-hansen mentioned this pull request Dec 13, 2024

[MAJOR] Update to TinyChat 2.0 mit-han-lab/llm-awq#244

Merged

casper-hansen deleted the flash_attn branch December 30, 2024 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused attention: Switch to Flash Decoding #656

Fused attention: Switch to Flash Decoding #656

casper-hansen commented Nov 26, 2024

vince62s commented Dec 19, 2024

casper-hansen commented Dec 19, 2024

Fused attention: Switch to Flash Decoding #656

Fused attention: Switch to Flash Decoding #656

Conversation

casper-hansen commented Nov 26, 2024

Current implementation

With flash-attn

vince62s commented Dec 19, 2024

casper-hansen commented Dec 19, 2024

With `flash-attn`