-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About 10% speed up: tweak diagonal shuffle #4
Conversation
I found it in the asm code. Sorry for having made noise. https://github.com/floodyberry/chacha-opt/blob/master/app/extensions/chacha/chacha_ssse3-64.inc#L586-L590 |
Thank you, this is very nice! The latency improvement up is indeed real and significant, and it looks like a couple of ChaCha implementations did use this trick before, though I had never noticed it. |
sneves/blake2-avx2#4 About 10% speed up: tweak diagonal shuffle
sneves/blake2-avx2#4 About 10% speed up: tweak diagonal shuffle
Novice question: Why is the data dependency on |
The original source for these optimizations is sneves/blake2-avx2#4 Libsodium committed them at jedisct1/libsodium@80206ad
You have the row round
Because If you draw a DAG of all the computations involved here, this optimization reduces the maximum depth of the graph, assuming infinite parallelism is available. The longest path from input to output values in this graph is the critical path, and is what determines the minimal possible latency of the computation (subject to other CPU parallelism constraints, etc etc). And yes, it makes perfect sense to do the same with BLAKE2s and SSE. I did this on the Wireguard implementation a while ago. |
oconnor663/blake2_simd@e26796e, contributed by Sean Gulley, implements the SSE4.1 version of this optimization for BLAKE2s. |
Hi, I have found this commit by a mistake but it got my interest. You have mentioned that this optimalization can be used for chacha algorithm as there is clearly same computation involved. It would be cool if you could review my testing chacha implementation, will try to upload it as repository during this weekend, hopefully I will find some time to finish it. |
The `b` state word is on the hot path, so we pivot the diagonalization to move the shuffles onto the other state words. See the code comment, or sneves/blake2-avx2#4 for additional details.
* chacha20: Add a `backend::avx2::StateWord` helper union This removes a bunch of instructions for accessing the 128-bit lanes. * chacha20: Rename backend state words to match RFC 7539 * chacha20: Optimise diagonalization in SSE2 and AVX2 backends The `b` state word is on the hot path, so we pivot the diagonalization to move the shuffles onto the other state words. See the code comment, or sneves/blake2-avx2#4 for additional details.
Late but maybe still useful... There are 3 main reasons why B is special:
Shuffling A instead of B addresses all these issues:
This appllies to all Blakes, Salsa and Chaha except that Chacha & Salsa don't have message injection so they don't suffer as much latency to start with and benefit less from these changes. The bit rotation optimization issue is moot with AVX512 or the upcoming AVX10 due to the availability of the VROR instruction. |
Hi, thank you for this awesome Blake2 implementation! When I read the source, I've come up with an idea to improvement.
This PR changed diagonal routines to shift
a
,c
andd
, instead ofb
,c
andd
. Since data dependency onb
is critical, the change improves performance. I confirmed the code passes all self-check tests.On my
BroadwellSkylake laptop, Blake2b cps improved from ~3.0 to ~2.7. I'm running Linux on VMWare, so measurement was unstable.BTW the tweak is applicable to other Blake2 / Chacha implementations. I guess someone already did the same thing, but I couldn't find any clue...