Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poly1305: AVX2 detection #97

Merged
merged 4 commits into from
Dec 9, 2020
Merged

poly1305: AVX2 detection #97

merged 4 commits into from
Dec 9, 2020

Conversation

tarcieri
Copy link
Member

@tarcieri tarcieri commented Dec 8, 2020

Automatically detects the availability of CLMUL based on CPUID, and falls back to the "soft" implementation if unavailable.

@tarcieri tarcieri requested review from newpavlov and str4d December 8, 2020 17:59
@codecov-io
Copy link

codecov-io commented Dec 8, 2020

Codecov Report

Merging #97 (3c5e9db) into master (0c84ace) will increase coverage by 19.94%.
The diff coverage is 70.45%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #97       +/-   ##
===========================================
+ Coverage   29.91%   49.86%   +19.94%     
===========================================
  Files          12       13        +1     
  Lines         976     1083      +107     
===========================================
+ Hits          292      540      +248     
+ Misses        684      543      -141     
Impacted Files Coverage Δ
poly1305/src/backend/soft.rs 0.00% <0.00%> (-93.51%) ⬇️
poly1305/src/autodetect.rs 38.46% <38.46%> (ø)
poly1305/src/backend/avx2/helpers.rs 65.61% <84.21%> (+65.61%) ⬆️
poly1305/src/backend/avx2.rs 92.59% <100.00%> (+92.59%) ⬆️
poly1305/src/lib.rs 88.88% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c84ace...3c5e9db. Read the comment docs.

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch 3 times, most recently from ae1e54c to d0a4784 Compare December 8, 2020 18:21
@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

Performance seems slightly reduced (maybe ~5% or so)

I'm guessing we should add #[target_feature(enable = "avx2")] annotations throughout the AVX2 code.

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch 3 times, most recently from c789536 to c7d61c7 Compare December 8, 2020 18:31
@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

So this branch definitely seems to be a little bit slower than master, and annoyingly that's the case even when compiling with -Ctarget-feature=+avx2.

Running RUSTFLAGS="-Ctarget-cpu=haswell -Ctarget-feature=+avx2" cargo +nightly bench on master I get:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         337 ns/iter (+/- 36) = 2967 MB/s
test bench3_10000 ... bench:       3,156 ns/iter (+/- 327) = 3168 MB/s

On this branch with RUSTFLAGS=-Ctarget-cpu=haswell:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 11) = 2702 MB/s
test bench3_1000  ... bench:         358 ns/iter (+/- 38) = 2793 MB/s
test bench3_10000 ... bench:       3,386 ns/iter (+/- 330) = 2953 MB/s

Curiously, those are approximately the same numbers I get with RUSTFLAGS="-Ctarget-cpu=haswell -Ctarget-feature=+avx2"

test bench1_10    ... bench:           8 ns/iter (+/- 1) = 1250 MB/s
test bench2_100   ... bench:          39 ns/iter (+/- 6) = 2564 MB/s
test bench3_1000  ... bench:         360 ns/iter (+/- 47) = 2777 MB/s
test bench3_10000 ... bench:       3,427 ns/iter (+/- 467) = 2918 MB/s

I tried liberally applying #[target_feature(enable = "avx2")] and #[inline(always)] and if anything it made things worse:

test bench1_10    ... bench:           9 ns/iter (+/- 1) = 1111 MB/s
test bench2_100   ... bench:          41 ns/iter (+/- 3) = 2439 MB/s
test bench3_1000  ... bench:         395 ns/iter (+/- 37) = 2531 MB/s
test bench3_10000 ... bench:       3,770 ns/iter (+/- 427) = 2652 MB/s

So it appears to be something tied to the newly added code, rather than compiling without target_feature=+avx2 in RUSTFLAGS. I'll try poking at it and see if I can find a solution.

It might be related to the use of an enum, but I didn't experience any sort of performance difference between an enum and an union before so that could be a red herring.

I can try opening a second PR that uses ManuallyDrop to use a union instead of an enum and see if that impacts performance, but ManuallyDrop won't be stable until Rust 1.49.

@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

After switching to a union + ManuallyDrop + cpuid-bool, the performance is what's expected from target-feature=+avx2 (since that configuration allows optimizing away the autodetection code):

test bench1_10    ... bench:           8 ns/iter (+/- 1) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         333 ns/iter (+/- 37) = 3003 MB/s
test bench3_10000 ... bench:       3,218 ns/iter (+/- 431) = 3107 MB/s

But also, we get the same performance with autodetection enabled!

test bench1_10    ... bench:           7 ns/iter (+/- 3) = 1428 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 3) = 2702 MB/s
test bench3_1000  ... bench:         340 ns/iter (+/- 35) = 2941 MB/s
test bench3_10000 ... bench:       3,242 ns/iter (+/- 496) = 3084 MB/s

So it seems the enum is slow. Given that I guess I can either investigate making avx2::State at least temporarily Copy, or we can hold off until Rust 1.49 is released with ManuallyDrop.

@tarcieri tarcieri closed this Dec 8, 2020
@tarcieri tarcieri reopened this Dec 8, 2020
@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch 2 times, most recently from 6b3f78c to 87a8708 Compare December 8, 2020 21:20
@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

Slapping Copy on all of the State structs turned out to be unexpectedly simple to the point I don't know what I was even worried about! There's no need to use ManuallyDrop in lieu of Copy here. And since they're an internal type it doesn't matter that they're Copy.

Performance with autodetection enabled now matches what it previously was with explicit target_feature:

test bench1_10    ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100   ... bench:          37 ns/iter (+/- 5) = 2702 MB/s
test bench3_1000  ... bench:         343 ns/iter (+/- 42) = 2915 MB/s
test bench3_10000 ... bench:       3,206 ns/iter (+/- 405) = 3119 MB/s

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from 87a8708 to c160d1c Compare December 8, 2020 21:23
@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

It seems like the target_flags annotations might be necessary after all.

I was previously compiling with RUSTFLAGS="-Ctarget-cpu=haswell", but I think that may be activating the avx2 target feature.

If I compile without that, I think LLVM may be emulating AVX2 or something?(!)

$ cargo +nightly bench
test bench1_10    ... bench:         168 ns/iter (+/- 67) = 59 MB/s
test bench2_100   ... bench:       1,207 ns/iter (+/- 153) = 82 MB/s
test bench3_1000  ... bench:      11,082 ns/iter (+/- 1,811) = 90 MB/s
test bench3_10000 ... bench:     107,971 ns/iter (+/- 17,203) = 92 MB/s

Note that this is significantly slower than the "soft" backend:

$ cargo +nightly bench --features force-soft
test bench1_10    ... bench:          16 ns/iter (+/- 2) = 625 MB/s
test bench2_100   ... bench:         109 ns/iter (+/- 14) = 917 MB/s
test bench3_1000  ... bench:         909 ns/iter (+/- 144) = 1100 MB/s
test bench3_10000 ... bench:       8,946 ns/iter (+/- 1,284) = 1117 MB/s

So it seems the target_feature(enable = "avx2") annotations may be necessary after all?

@tarcieri tarcieri closed this Dec 8, 2020
@tarcieri tarcieri reopened this Dec 8, 2020
@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from 713da83 to 8dee7e9 Compare December 8, 2020 21:53
@tarcieri
Copy link
Member Author

tarcieri commented Dec 8, 2020

Wow, that's wild.

I tried applying target_feature(enable = "avx2") or inline(always) throughout the backend in 8dee7e9, and that looked like it solved the problem locally: the tests were passing and I was getting the same sort of benchmark numbers I was with the target_feature explicitly enabled:

$ cargo +nightly bench
test bench1_10    ... bench:           8 ns/iter (+/- 3) = 1250 MB/s
test bench2_100   ... bench:          39 ns/iter (+/- 4) = 2564 MB/s
test bench3_1000  ... bench:         347 ns/iter (+/- 51) = 2881 MB/s
test bench3_10000 ... bench:       3,391 ns/iter (+/- 517) = 2948 MB/s

However, it seems to be failing the tests in CI!

https://github.com/RustCrypto/universal-hashes/pull/97/checks?check_run_id=1520303853#step:6:34

---- fuzz::crash_0 stdout ----
thread 'fuzz::crash_0' panicked at 'assertion failed: `(left == right)`
  left: `(1, [3, 6, 145, 53, 167, 0, 109, 39, 232, 34, 230, 215, 47, 111, 139, 154])`,
 right: `(1, [100, 63, 40, 236, 209, 133, 125, 79, 131, 206, 166, 142, 102, 9, 105, 59])`', poly1305/src/fuzz.rs:31:9

Whatever broke it is something I changed in 8dee7e9, but yet it works locally for me.

Automatically detects the availability of CLMUL based on CPUID, and
falls back to the "soft" implementation if unavailable.
@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch 5 times, most recently from dee84bf to c4a31a7 Compare December 8, 2020 23:14
@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

I pushed up a debug commit (db6ded4) to test whether or not cpuid-bool is detecting AVX2 inside the GitHub Actions worker. It is:

https://github.com/RustCrypto/universal-hashes/pull/97/checks?check_run_id=1525715319

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from db6ded4 to 976c4f5 Compare December 9, 2020 17:54
@newpavlov
Copy link
Member

Well, currently I can only propose to insert a number of println!s to track the computation state and compare output with local results... We already had the LLVM bug which caused miscompilation of the aes crate, so probability of a compiler bug is not zero.

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

@newpavlov yep, just pushed up a commit with some debugging: a160473

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch 6 times, most recently from bf21b8e to 5de11d4 Compare December 9, 2020 18:25
@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

Okay, I finally have it reproduced locally.

It's passing without --release and failing with --release. I wasn't including --release locally so it was passing for me.

If I add RUSTFLAGS="-Ctarget-cpu=haswell" it passes with --release

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

This is almost certainly a rustc bug.

I pushed up 7485010 which includes ample information about my investigation.

Among other things, I was able to insert a dbg! statement in a location that makes the tests succeed (heisenbug!)

I also managed to narrow down where I first observed values being miscomputed.

Will try to report this upstream.

@newpavlov
Copy link
Member

As a temporary hack how about moving part of the affected code into a separate function marked with #[inline(never)]?

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

Yeah, I can try extracting the lambda into a named function and play with inlining options on it

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from 7485010 to eeab81a Compare December 9, 2020 20:39
@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from eeab81a to b5abdc1 Compare December 9, 2020 20:42
@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

Alright, managed to work around the problem 🎉

The miscompilation was occurring inside of some lambdas.

I extracted the lambdas into named functions in b5abdc1, and the tests are now passing.

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

Benchmarks look great too:

$ cargo +nightly bench
test bench1_10    ... bench:           7 ns/iter (+/- 1) = 1428 MB/s
test bench2_100   ... bench:          31 ns/iter (+/- 3) = 3225 MB/s
test bench3_1000  ... bench:         281 ns/iter (+/- 62) = 3558 MB/s
test bench3_10000 ... bench:       2,724 ns/iter (+/- 216) = 3671 MB/s

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

Let me try to address the zeroize comments using the approach @newpavlov suggested and then I'll merge

@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

@newpavlov double check me on 3c5e9db, but I think that's what you had in mind for zeroization

@newpavlov
Copy link
Member

Yes, it's what I had in mind. It could be a bit more compact to cast into an array:

const SIZE: usize = core::mem::size_of::<State>();
let inner_array = unsafe {
    &mut *(self as *mut State as *mut [u8; SIZE])
};

BTW you don't have to use self.inner, since token is a ZST type, so you can zeroize the State type directly.

@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from 3c5e9db to 20dd65a Compare December 9, 2020 21:37
@tarcieri tarcieri force-pushed the poly1305/avx2-detection branch from 20dd65a to 442c32c Compare December 9, 2020 21:38
@tarcieri
Copy link
Member Author

tarcieri commented Dec 9, 2020

@newpavlov cool, just updated it with your suggestion

@tarcieri tarcieri merged commit cd5508c into master Dec 9, 2020
@tarcieri tarcieri deleted the poly1305/avx2-detection branch December 9, 2020 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants