-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surprising optimization differences between variants of the same code #48200
Comments
According to godbolt, 1.11.0 is the last version that generated two compare/branches for the first rust code, like the others, although it still did generate slightly different code for the last two: example::foo:
push rbp
mov rbp, rsp
xor eax, eax
cmp qword ptr [rsi], 1
jne .LBB0_4
cmp qword ptr [rdi], 1
jne .LBB0_3
mov rax, qword ptr [rdi + 8]
add rax, qword ptr [rsi + 8]
pop rbp
ret
.LBB0_3:
xor eax, eax
.LBB0_4:
pop rbp
ret
example::bar:
push rbp
mov rbp, rsp
xor eax, eax
cmp qword ptr [rdi], 1
jne .LBB1_3
xor eax, eax
cmp qword ptr [rsi], 1
jne .LBB1_3
mov rax, qword ptr [rsi + 8]
add rax, qword ptr [rdi + 8]
.LBB1_3:
pop rbp
ret Note there is no duplicate return code there. |
Timing the various versions with a simple benchmark could be useful. |
Benched with: #![feature(test)]
extern crate rand;
extern crate test;
use rand::Rng;
use test::Bencher;
#[inline(never)]
pub fn foo(a: Option<usize>, b: Option<usize>) -> usize {
if let (Some(a), Some(b)) = (a, b) {
a + b
} else {
0
}
}
#[inline(never)]
pub fn bar(a: Option<usize>, b: Option<usize>) -> usize {
if a.is_some() && b.is_some() {
a.unwrap() + b.unwrap()
} else {
0
}
}
#[inline(never)]
pub fn baz(a: Option<usize>, b: Option<usize>) -> usize {
if let Some(a) = a {
if let Some(b) = b {
return a + b;
}
}
0
}
#[inline(never)]
fn bench(b: &mut Bencher, f: fn(Option<usize>, Option<usize>) -> usize) {
let seed: &[_] = &[1, 2, 3, 4];
let mut rng: rand::StdRng = rand::SeedableRng::from_seed(seed);
let data : Vec<_> = rng.gen_iter::<(Option<usize>, Option<usize>)>().take(1000).collect();
b.iter(|| {
data.iter().fold(0, |acc, &(a, b)| acc + f(a, b))
});
}
#[bench]
fn bench_foo(b: &mut Bencher) {
bench(b, foo)
}
#[bench]
fn bench_bar(b: &mut Bencher) {
bench(b, bar)
}
#[bench]
fn bench_baz(b: &mut Bencher) {
bench(b, baz)
} Multiple attempts, even though the error margins are high, all put
Of note, removing the Micro-benchmarks are hard. I'm inclined to think the current version above is the fairer. |
With #49420, the LLVM IR for the 3 functions become: Click to expand LLVM IR.; issue_48200::foo
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003foo17h7f3d764b4e2d0d3cE(i64 %a.0, i64 %a.1, i64 %b.0, i64 %b.1) unnamed_addr #0 {
start:
%cond = icmp eq i64 %a.0, 1
%cond1 = icmp eq i64 %b.0, 1
%or.cond = and i1 %cond, %cond1
%0 = add i64 %b.1, %a.1
%_0.0 = select i1 %or.cond, i64 %0, i64 0
ret i64 %_0.0
}
; issue_48200::bar
; Function Attrs: uwtable
define i64 @_ZN11issue_482003bar17he38cd8ae2cebf23eE(i64, i64, i64, i64) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
%switch.i = icmp eq i64 %0, 1
%switch.i2 = icmp eq i64 %2, 1
%or.cond = and i1 %switch.i, %switch.i2
%4 = add i64 %3, %1
%_0.0 = select i1 %or.cond, i64 %4, i64 0
ret i64 %_0.0
}
; issue_48200::baz
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003baz17h789590496be40da2E(i64, i64, i64, i64) unnamed_addr #0 {
start:
%cond = icmp eq i64 %0, 1
%cond1 = icmp eq i64 %2, 1
%or.cond = and i1 %cond, %cond1
%4 = add i64 %3, %1
%_0.0 = select i1 %or.cond, i64 %4, i64 0
ret i64 %_0.0
} As you can see, the three functions are now identical. For reference, this is the produced assembly with it: Click to expand ASM. .section __TEXT,__text,regular,pure_instructions
.globl __ZN11issue_482003foo17h7f3d764b4e2d0d3cE
.p2align 4, 0x90
__ZN11issue_482003foo17h7f3d764b4e2d0d3cE:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
xorq $1, %rdi
xorq $1, %rdx
addq %rcx, %rsi
xorl %eax, %eax
orq %rdi, %rdx
cmoveq %rsi, %rax
popq %rbp
retq
.cfi_endproc
.globl __ZN11issue_482003bar17he38cd8ae2cebf23eE
.p2align 4, 0x90
__ZN11issue_482003bar17he38cd8ae2cebf23eE:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
xorq $1, %rdi
xorq $1, %rdx
addq %rcx, %rsi
xorl %eax, %eax
orq %rdi, %rdx
cmoveq %rsi, %rax
popq %rbp
retq
.cfi_endproc
.globl __ZN11issue_482003baz17h789590496be40da2E
.p2align 4, 0x90
__ZN11issue_482003baz17h789590496be40da2E:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
xorq $1, %rdi
xorq $1, %rdx
addq %rcx, %rsi
xorl %eax, %eax
orq %rdi, %rdx
cmoveq %rsi, %rax
popq %rbp
retq
.cfi_endproc
.subsections_via_symbols |
Looks like all three are identical in nightly, so closing: https://play.rust-lang.org/?gist=4785ad9a9540c82a45ed4c1a42ee2450&version=nightly&mode=release&edition=2015 In fact, some are so identical that the merge-identical-methods LLVM pass got rid of one 🙂 playground::foo:
jmp playground::baz@PLT |
O_o why not both? Also, does rust actually care about function addresses? Because if not, it could use aliases rather than dummy functions that jump. |
Also note that this might have only changed because those Option are now passed by value (two registers each), rather than as pseudo-refs (one pointer each, presumably to somewhere on the stack). |
Looking at the LLVM, it seems two are |
I was looking at the assembly output of the following code, to see if rust would do a short-circuit or not:
And got the following result:
(thanks godbolt)
Which, come to think of it, might make sense, although I'm not sure reading the data and making the addition before checking the second Option tag is better than avoiding the branch this all allows to avoid. So, let's assume this is actually better than doing two compare/branch at the beginning, at the very least, the compiled code should be the same as for either of the following:
Both compile to the code below, which is different than the original one:
BTW, note how it generates two branch targets with the same return code instead of reusing the first one for the second branch.
The text was updated successfully, but these errors were encountered: