Description
Started on an update to https://github.com/Marwes/combine after it being dormant for a while. When I ran the benchmarks to check that my changes hadn't regressed the performance I noticed that performance had regressed by ~28% (~116% with incremental compilation!) since the last time ran benchmarks (somewhere around September 2016).
I ran the benchmarks again against an old version of the library to be able to compile it with older rustc's but the regression is the same in the master branch as well.
cargo bench --bench http
against https://github.com/Marwes/combine/tree/v2.3.2
cargo-0.18.0-nightly (a73a665 2017-02-14)
test http_requests_large ... bench: 439,961 ns/iter (+/- 30,684)
test http_requests_small ... bench: 87,508 ns/iter (+/- 5,173)
rustc 1.19.0-nightly (554c685 2017-06-14)
test http_requests_large ... bench: 475,989 ns/iter (+/- 10,477)
test http_requests_small ... bench: 95,175 ns/iter (+/- 23,751)
rustc 1.22.0-nightly (3681220 2017-09-06)
test http_requests_large ... bench: 494,088 ns/iter (+/- 27,462)
test http_requests_small ... bench: 102,798 ns/iter (+/- 67,446)
rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=0)
test http_requests_large ... bench: 551,065 ns/iter (+/- 420,621)
test http_requests_small ... bench: 112,375 ns/iter (+/- 2,098)
rustc 1.25.0-nightly (6828cf9 2018-01-06) (CARGO_INCREMENTAL=1)
test http_requests_large ... bench: 1,001,847 ns/iter (+/- 40,639)
test http_requests_small ... bench: 188,091 ns/iter (+/- 1,958)
I'd like to bisect this further but the two tools I found for this do not appear to work in this case, is there any other tool that can be used for this?
https://github.com/kamalmarhubi/rust-bisect (Outdated)
https://github.com/Mark-Simulacrum/bisect-rust/tree/master/src (Only last 90 days)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
eddyb commentedon Jan 19, 2018
cc @michaelwoerister @arielb1 @nagisa
michaelwoerister commentedon Jan 19, 2018
It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.
The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
What happens if you do
RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0 cargo bench
?cc @alexcrichton
Marwes commentedon Jan 19, 2018
(These are on a different machine so numbers won't be comparable to the ones above)
env RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0
rustc 1.17.0-nightly (62eb605 2017-02-15)
rustc 1.25.0-nightly (0f9c784 2018-01-17)
Without specifying Ccodegen-units=1
rustc 1.25.0-nightly (0f9c784 2018-01-17)
rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)
So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .
Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)
Marwes commentedon Jan 19, 2018
Yeah I figured ThinLTO might be one of the culprits but I don't believe there is a way to use full LTO atm (#47521). Depending on how much ThinLTO affects the runtimes though that might still be a problem however.
michaelwoerister commentedon Jan 22, 2018
247 vs 325 microseconds is indeed quite the slowdown. I would be interesting to see where all the additional time goes. Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.
Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?
Yes, that is out of date since a few weeks ago, I think. cc @alexcrichton & @rust-lang/cargo
At some point we'll probably have incremental ThinLTO, at which point runtime performance should be quite a bit better. However, I'm not sure if that will ever be the default since it will cause longer compile times than regular incremental compilation. Incremental compilation strives to give you code that is "fast enough" and otherwise clearly prioritizes short compile times.
Marwes commentedon Jan 22, 2018
Tried gleaning something from profiler output I didn't spot anything as the runtime is spread out in a lot of different places and inlining tears through the code completely. I may give it another shot but I don't expect much.
It is unlikely to do with the benchmark framework. I am comparing these numbers to https://github.com/Geal/nom_benchmarks/blob/master/http/nom-http/src/main.rs which also uses it and that benchmark hasn't regressed. (I actually use https://github.com/bluss/bencher/ but switching to the builtin one shows no change in runtime).
Added another result to #47561 (comment) . Forcing full lto helps a bit but is not the sole problem.
rustc 1.25.0-nightly (0f9c784 2018-01-17) (codegen-units=1 -Z thintlto=no lto=true)
kennytm commentedon Jan 22, 2018
I've checked this from 1.13.0 to the current nightly and it is obvious that the regression is introduced in 1.19.0 and 1.20.0. The speed is recovering in more recent version (with CGU=1 and ThinLTO=off) but still haven't come back to the 1.19.0 level.
Graphs
Raw data
arielb1 commentedon Jan 22, 2018
@kennytm
Cool. Could you try to bisect to the specific commit? I might then take a go at investigating it.
kennytm commentedon Jan 22, 2018
@arielb1 Probably not to a specific commit, but I'm now doing a date-based bisection which should point to a narrow-enough commit range.
kennytm commentedon Jan 22, 2018
From the timing of nightlies between 2017-04-24 to 2017-07-17:
nightly-2017-04-26
(+43µs): 63c7721...2b4c911 (likely LLVM 4.0 Upgrade #40123 (LLVM 4.0 Upgrade))nightly-2017-05-14
(+26µs): e17a122...826d8f3 (likely remove the #[inline] attribute from drop_in_place #41920 (remove the #[inline] attribute from drop_in_place))nightly-2017-06-21
(+36µs): 0414594...4450779 (???)nightly-2017-06-23
(+22µs): 622e7e6...ab5bec2 (likely mark calls in the unwind path as !noinline #42771 (mark calls in the unwind path as !noinline))nightly-2017-07-07
(-48µs): 3610a70...696412d (likely Switch to rust-lang-nursery/compiler-builtins #42899 (Switch to rust-lang-nursery/compiler-builtins) or rustc: Implement stack probes for x86 #42816 (rustc: Implement stack probes for x86))Graphs
Raw data
arielb1 commentedon Jan 23, 2018
Not much we can do about that without a fairly deep investigation. We're about to update to LLVM 5.0 or 6.0, so let's hope that fixes the regression
This looks like #42313 and #42727 - the first try at allocator integration had some perf issues - missing
inline
attribute, which were fixed by #42727.This sort of makes sense, but is weird. I might look into it
17 remaining items