SwiftKV support (~2x performance boost) #1101

coder543 · 2025-01-25T14:10:23Z

blog post: https://www.snowflake.com/en/engineering-blog/swiftkv-llm-compute-reduction/

full paper: https://arxiv.org/abs/2410.03960

Snowflake documented a new KV-cache optimization that can yield significant performance improvements. They're already integrating this into vLLM.

Specifically, Snowflake has introduced SwiftKV, a method designed to address the computational bottleneck associated with processing long input prompts during inference. In many enterprise use cases, the number of prompt tokens significantly exceeds the number of generated tokens. SwiftKV tackles this by intelligently reusing computations from earlier transformer layers to generate the KV cache for subsequent layers, a technique they refer to as "SingleInputKV". This approach avoids redundant calculations in later layers, where outputs tend to stabilize. Additionally, "AcrossKV" provides memory compression that can be used alongside SingleInputKV.

Importantly, Snowflake's benchmarks indicate that these optimizations result in a minimal loss of accuracy, typically around 1 point on average, as shown in their blog post. This suggests that the performance gains are achieved without significant compromises to the model's output quality. Their tests, using Llama 3.1 models on H100 GPUs, demonstrate substantial throughput gains (up to 2x) and latency reductions, particularly for long-input scenarios.

(Just a cool new paper I thought might be of interest here.)

coder543 added the new feature New feature or request label Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiftKV support (~2x performance boost) #1101

SwiftKV support (~2x performance boost) #1101

coder543 commented Jan 25, 2025

SwiftKV support (~2x performance boost) #1101

SwiftKV support (~2x performance boost) #1101

Comments

coder543 commented Jan 25, 2025