poly1305: AVX2 detection #97

tarcieri · 2020-12-08T17:59:21Z

Automatically detects the availability of CLMUL based on CPUID, and falls back to the "soft" implementation if unavailable.

codecov-io · 2020-12-08T18:00:14Z

Codecov Report

Merging #97 (3c5e9db) into master (0c84ace) will increase coverage by 19.94%.
The diff coverage is 70.45%.

@@             Coverage Diff             @@
##           master      #97       +/-   ##
===========================================
+ Coverage   29.91%   49.86%   +19.94%     
===========================================
  Files          12       13        +1     
  Lines         976     1083      +107     
===========================================
+ Hits          292      540      +248     
+ Misses        684      543      -141

Impacted Files	Coverage Δ
poly1305/src/backend/soft.rs	`0.00% <0.00%> (-93.51%)`	⬇️
poly1305/src/autodetect.rs	`38.46% <38.46%> (ø)`
poly1305/src/backend/avx2/helpers.rs	`65.61% <84.21%> (+65.61%)`	⬆️
poly1305/src/backend/avx2.rs	`92.59% <100.00%> (+92.59%)`	⬆️
poly1305/src/lib.rs	`88.88% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c84ace...3c5e9db. Read the comment docs.

poly1305/src/autodetect.rs

tarcieri · 2020-12-08T18:23:26Z

Performance seems slightly reduced (maybe ~5% or so)

I'm guessing we should add #[target_feature(enable = "avx2")] annotations throughout the AVX2 code.

tarcieri · 2020-12-08T20:50:51Z

So this branch definitely seems to be a little bit slower than master, and annoyingly that's the case even when compiling with -Ctarget-feature=+avx2.

Running RUSTFLAGS="-Ctarget-cpu=haswell -Ctarget-feature=+avx2" cargo +nightly bench on master I get:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         337 ns/iter (+/- 36) = 2967 MB/s
test bench3_10000 ... bench:       3,156 ns/iter (+/- 327) = 3168 MB/s

On this branch with RUSTFLAGS=-Ctarget-cpu=haswell:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 11) = 2702 MB/s
test bench3_1000  ... bench:         358 ns/iter (+/- 38) = 2793 MB/s
test bench3_10000 ... bench:       3,386 ns/iter (+/- 330) = 2953 MB/s

Curiously, those are approximately the same numbers I get with RUSTFLAGS="-Ctarget-cpu=haswell -Ctarget-feature=+avx2"

test bench1_10    ... bench:           8 ns/iter (+/- 1) = 1250 MB/s
test bench2_100   ... bench:          39 ns/iter (+/- 6) = 2564 MB/s
test bench3_1000  ... bench:         360 ns/iter (+/- 47) = 2777 MB/s
test bench3_10000 ... bench:       3,427 ns/iter (+/- 467) = 2918 MB/s

I tried liberally applying #[target_feature(enable = "avx2")] and #[inline(always)] and if anything it made things worse:

test bench1_10    ... bench:           9 ns/iter (+/- 1) = 1111 MB/s
test bench2_100   ... bench:          41 ns/iter (+/- 3) = 2439 MB/s
test bench3_1000  ... bench:         395 ns/iter (+/- 37) = 2531 MB/s
test bench3_10000 ... bench:       3,770 ns/iter (+/- 427) = 2652 MB/s

So it appears to be something tied to the newly added code, rather than compiling without target_feature=+avx2 in RUSTFLAGS. I'll try poking at it and see if I can find a solution.

It might be related to the use of an enum, but I didn't experience any sort of performance difference between an enum and an union before so that could be a red herring.

I can try opening a second PR that uses ManuallyDrop to use a union instead of an enum and see if that impacts performance, but ManuallyDrop won't be stable until Rust 1.49.

tarcieri · 2020-12-08T21:10:00Z

After switching to a union + ManuallyDrop + cpuid-bool, the performance is what's expected from target-feature=+avx2 (since that configuration allows optimizing away the autodetection code):

test bench1_10    ... bench:           8 ns/iter (+/- 1) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         333 ns/iter (+/- 37) = 3003 MB/s
test bench3_10000 ... bench:       3,218 ns/iter (+/- 431) = 3107 MB/s

But also, we get the same performance with autodetection enabled!

test bench1_10    ... bench:           7 ns/iter (+/- 3) = 1428 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 3) = 2702 MB/s
test bench3_1000  ... bench:         340 ns/iter (+/- 35) = 2941 MB/s
test bench3_10000 ... bench:       3,242 ns/iter (+/- 496) = 3084 MB/s

So it seems the enum is slow. Given that I guess I can either investigate making avx2::State at least temporarily Copy, or we can hold off until Rust 1.49 is released with ManuallyDrop.

tarcieri · 2020-12-08T21:22:04Z

Slapping Copy on all of the State structs turned out to be unexpectedly simple to the point I don't know what I was even worried about! There's no need to use ManuallyDrop in lieu of Copy here. And since they're an internal type it doesn't matter that they're Copy.

Performance with autodetection enabled now matches what it previously was with explicit target_feature:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         343 ns/iter (+/- 42) = 2915 MB/s
test bench3_10000 ... bench:       3,206 ns/iter (+/- 405) = 3119 MB/s

tarcieri · 2020-12-08T21:38:49Z

It seems like the target_flags annotations might be necessary after all.

I was previously compiling with RUSTFLAGS="-Ctarget-cpu=haswell", but I think that may be activating the avx2 target feature.

If I compile without that, I think LLVM may be emulating AVX2 or something?(!)

$ cargo +nightly bench
test bench1_10    ... bench:         168 ns/iter (+/- 67) = 59 MB/s
test bench2_100   ... bench:       1,207 ns/iter (+/- 153) = 82 MB/s
test bench3_1000  ... bench:      11,082 ns/iter (+/- 1,811) = 90 MB/s
test bench3_10000 ... bench:     107,971 ns/iter (+/- 17,203) = 92 MB/s

Note that this is significantly slower than the "soft" backend:

$ cargo +nightly bench --features force-soft
test bench1_10    ... bench:          16 ns/iter (+/- 2) = 625 MB/s
test bench2_100   ... bench:         109 ns/iter (+/- 14) = 917 MB/s
test bench3_1000  ... bench:         909 ns/iter (+/- 144) = 1100 MB/s
test bench3_10000 ... bench:       8,946 ns/iter (+/- 1,284) = 1117 MB/s

So it seems the target_feature(enable = "avx2") annotations may be necessary after all?

tarcieri · 2020-12-08T22:00:50Z

Wow, that's wild.

I tried applying target_feature(enable = "avx2") or inline(always) throughout the backend in 8dee7e9, and that looked like it solved the problem locally: the tests were passing and I was getting the same sort of benchmark numbers I was with the target_feature explicitly enabled:

$ cargo +nightly bench
test bench1_10    ... bench:           8 ns/iter (+/- 3) = 1250 MB/s
test bench2_100   ... bench:          39 ns/iter (+/- 4) = 2564 MB/s
test bench3_1000  ... bench:         347 ns/iter (+/- 51) = 2881 MB/s
test bench3_10000 ... bench:       3,391 ns/iter (+/- 517) = 2948 MB/s

However, it seems to be failing the tests in CI!

https://github.com/RustCrypto/universal-hashes/pull/97/checks?check_run_id=1520303853#step:6:34

---- fuzz::crash_0 stdout ----
thread 'fuzz::crash_0' panicked at 'assertion failed: `(left == right)`
  left: `(1, [3, 6, 145, 53, 167, 0, 109, 39, 232, 34, 230, 215, 47, 111, 139, 154])`,
 right: `(1, [100, 63, 40, 236, 209, 133, 125, 79, 131, 206, 166, 142, 102, 9, 105, 59])`', poly1305/src/fuzz.rs:31:9

Whatever broke it is something I changed in 8dee7e9, but yet it works locally for me.

Automatically detects the availability of CLMUL based on CPUID, and falls back to the "soft" implementation if unavailable.

tarcieri · 2020-12-09T17:53:21Z

I pushed up a debug commit (db6ded4) to test whether or not cpuid-bool is detecting AVX2 inside the GitHub Actions worker. It is:

https://github.com/RustCrypto/universal-hashes/pull/97/checks?check_run_id=1525715319

newpavlov · 2020-12-09T18:00:49Z

Well, currently I can only propose to insert a number of println!s to track the computation state and compare output with local results... We already had the LLVM bug which caused miscompilation of the aes crate, so probability of a compiler bug is not zero.

tarcieri · 2020-12-09T18:02:18Z

@newpavlov yep, just pushed up a commit with some debugging: a160473

tarcieri · 2020-12-09T18:30:29Z

Okay, I finally have it reproduced locally.

It's passing without --release and failing with --release. I wasn't including --release locally so it was passing for me.

If I add RUSTFLAGS="-Ctarget-cpu=haswell" it passes with --release

tarcieri · 2020-12-09T19:38:40Z

This is almost certainly a rustc bug.

I pushed up 7485010 which includes ample information about my investigation.

Among other things, I was able to insert a dbg! statement in a location that makes the tests succeed (heisenbug!)

I also managed to narrow down where I first observed values being miscomputed.

Will try to report this upstream.

newpavlov · 2020-12-09T20:13:29Z

As a temporary hack how about moving part of the affected code into a separate function marked with #[inline(never)]?

tarcieri · 2020-12-09T20:32:25Z

Yeah, I can try extracting the lambda into a named function and play with inlining options on it

tarcieri · 2020-12-09T20:46:12Z

Alright, managed to work around the problem 🎉

The miscompilation was occurring inside of some lambdas.

I extracted the lambdas into named functions in b5abdc1, and the tests are now passing.

tarcieri · 2020-12-09T20:52:07Z

Benchmarks look great too:

$ cargo +nightly bench
test bench1_10    ... bench:           7 ns/iter (+/- 1) = 1428 MB/s
test bench2_100   ... bench:          31 ns/iter (+/- 3) = 3225 MB/s
test bench3_1000  ... bench:         281 ns/iter (+/- 62) = 3558 MB/s
test bench3_10000 ... bench:       2,724 ns/iter (+/- 216) = 3671 MB/s

tarcieri · 2020-12-09T20:53:33Z

Let me try to address the zeroize comments using the approach @newpavlov suggested and then I'll merge

tarcieri · 2020-12-09T21:05:37Z

@newpavlov double check me on 3c5e9db, but I think that's what you had in mind for zeroization

newpavlov · 2020-12-09T21:30:55Z

Yes, it's what I had in mind. It could be a bit more compact to cast into an array:

const SIZE: usize = core::mem::size_of::<State>();
let inner_array = unsafe {
    &mut *(self as *mut State as *mut [u8; SIZE])
};

BTW you don't have to use self.inner, since token is a ZST type, so you can zeroize the State type directly.

tarcieri · 2020-12-09T21:38:30Z

@newpavlov cool, just updated it with your suggestion

tarcieri requested review from newpavlov and str4d December 8, 2020 17:59

tarcieri commented Dec 8, 2020

View reviewed changes

poly1305/src/autodetect.rs Outdated Show resolved Hide resolved

tarcieri force-pushed the poly1305/avx2-detection branch 3 times, most recently from ae1e54c to d0a4784 Compare December 8, 2020 18:21

tarcieri force-pushed the poly1305/avx2-detection branch 3 times, most recently from c789536 to c7d61c7 Compare December 8, 2020 18:31

tarcieri closed this Dec 8, 2020

tarcieri reopened this Dec 8, 2020

tarcieri force-pushed the poly1305/avx2-detection branch 2 times, most recently from 6b3f78c to 87a8708 Compare December 8, 2020 21:20

tarcieri force-pushed the poly1305/avx2-detection branch from 87a8708 to c160d1c Compare December 8, 2020 21:23

tarcieri closed this Dec 8, 2020

tarcieri reopened this Dec 8, 2020

tarcieri force-pushed the poly1305/avx2-detection branch from 713da83 to 8dee7e9 Compare December 8, 2020 21:53

poly1305: AVX2 detection

9ba0bab

Automatically detects the availability of CLMUL based on CPUID, and falls back to the "soft" implementation if unavailable.

tarcieri force-pushed the poly1305/avx2-detection branch 5 times, most recently from dee84bf to c4a31a7 Compare December 8, 2020 23:14

tarcieri force-pushed the poly1305/avx2-detection branch from db6ded4 to 976c4f5 Compare December 9, 2020 17:54

tarcieri force-pushed the poly1305/avx2-detection branch 6 times, most recently from bf21b8e to 5de11d4 Compare December 9, 2020 18:25

tarcieri force-pushed the poly1305/avx2-detection branch from 7485010 to eeab81a Compare December 9, 2020 20:39

Extract problematic lambdas into named functions

b5abdc1

tarcieri force-pushed the poly1305/avx2-detection branch from eeab81a to b5abdc1 Compare December 9, 2020 20:42

newpavlov approved these changes Dec 9, 2020

View reviewed changes

tarcieri force-pushed the poly1305/avx2-detection branch from 3c5e9db to 20dd65a Compare December 9, 2020 21:37

poly1305: impl Zeroize for autodetect::State

442c32c

tarcieri force-pushed the poly1305/avx2-detection branch from 20dd65a to 442c32c Compare December 9, 2020 21:38

tarcieri merged commit cd5508c into master Dec 9, 2020

tarcieri deleted the poly1305/avx2-detection branch December 9, 2020 21:55

str4d mentioned this pull request Jan 12, 2021

Rage is 38% slower at encrypting than Go implementation str4d/rage#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poly1305: AVX2 detection #97

poly1305: AVX2 detection #97

tarcieri commented Dec 8, 2020

codecov-io commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 9, 2020 •

edited

Loading

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020 •

edited

Loading

tarcieri commented Dec 9, 2020

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

poly1305: AVX2 detection #97

poly1305: AVX2 detection #97

Conversation

tarcieri commented Dec 8, 2020

codecov-io commented Dec 8, 2020 • edited Loading

Codecov Report

tarcieri commented Dec 8, 2020

tarcieri commented Dec 8, 2020 • edited Loading

tarcieri commented Dec 8, 2020

tarcieri commented Dec 8, 2020 • edited Loading

tarcieri commented Dec 8, 2020 • edited Loading

tarcieri commented Dec 8, 2020 • edited Loading

tarcieri commented Dec 9, 2020 • edited Loading

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020 • edited Loading

tarcieri commented Dec 9, 2020

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

tarcieri commented Dec 9, 2020

newpavlov commented Dec 9, 2020

tarcieri commented Dec 9, 2020

codecov-io commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 8, 2020 •

edited

Loading

tarcieri commented Dec 9, 2020 •

edited

Loading

tarcieri commented Dec 9, 2020 •

edited

Loading