Surprising optimization differences between variants of the same code #48200

glandium · 2018-02-14T07:18:19Z

I was looking at the assembly output of the following code, to see if rust would do a short-circuit or not:

pub fn foo(a: Option<usize>, b: Option<usize>) -> usize {
    if let (Some(a), Some(b)) = (a, b) {
        a + b
    } else {
        0
    }
}

And got the following result:

example::foo:
  push rbp
  mov rbp, rsp
  cmp qword ptr [rdi], 1
  jne .LBB0_1
  mov rcx, qword ptr [rdi + 8]
  add rcx, qword ptr [rsi + 8]
  xor eax, eax
  cmp qword ptr [rsi], 1
  cmove rax, rcx
  pop rbp
  ret
.LBB0_1:
  xor eax, eax
  pop rbp
  ret

(thanks godbolt)

Which, come to think of it, might make sense, although I'm not sure reading the data and making the addition before checking the second Option tag is better than avoiding the branch this all allows to avoid. So, let's assume this is actually better than doing two compare/branch at the beginning, at the very least, the compiled code should be the same as for either of the following:

pub fn bar(a: Option<usize>, b: Option<usize>) -> usize {
    if a.is_some() && b.is_some() {
        a.unwrap() + b.unwrap()
    } else {
        0
    }
}

pub fn baz(a: Option<usize>, b: Option<usize>) -> usize {
    if let Some(a) = a {
        if let Some(b) = b {
            return a + b;
        }
    }
    0
}

Both compile to the code below, which is different than the original one:

  push rbp
  mov rbp, rsp
  cmp qword ptr [rdi], 1
  jne .LBB1_1
  cmp qword ptr [rsi], 1
  jne .LBB1_3
  mov rax, qword ptr [rsi + 8]
  add rax, qword ptr [rdi + 8]
  pop rbp
  ret
.LBB1_1:
  xor eax, eax
  pop rbp
  ret
.LBB1_3:
  xor eax, eax
  pop rbp
  ret

BTW, note how it generates two branch targets with the same return code instead of reusing the first one for the second branch.

glandium · 2018-02-14T07:24:31Z

According to godbolt, 1.11.0 is the last version that generated two compare/branches for the first rust code, like the others, although it still did generate slightly different code for the last two:

example::foo:
  push rbp
  mov rbp, rsp
  xor eax, eax
  cmp qword ptr [rsi], 1
  jne .LBB0_4
  cmp qword ptr [rdi], 1
  jne .LBB0_3
  mov rax, qword ptr [rdi + 8]
  add rax, qword ptr [rsi + 8]
  pop rbp
  ret
.LBB0_3:
  xor eax, eax
.LBB0_4:
  pop rbp
  ret

example::bar:
  push rbp
  mov rbp, rsp
  xor eax, eax
  cmp qword ptr [rdi], 1
  jne .LBB1_3
  xor eax, eax
  cmp qword ptr [rsi], 1
  jne .LBB1_3
  mov rax, qword ptr [rsi + 8]
  add rax, qword ptr [rdi + 8]
.LBB1_3:
  pop rbp
  ret

Note there is no duplicate return code there.

leonardo-m · 2018-02-14T09:09:20Z

Timing the various versions with a simple benchmark could be useful.

glandium · 2018-02-14T23:39:31Z

Benched with:

#![feature(test)]

extern crate rand;
extern crate test;

use rand::Rng;
use test::Bencher;

#[inline(never)]
pub fn foo(a: Option<usize>, b: Option<usize>) -> usize {
    if let (Some(a), Some(b)) = (a, b) {
        a + b
    } else {
        0
    }
}

#[inline(never)]
pub fn bar(a: Option<usize>, b: Option<usize>) -> usize {
    if a.is_some() && b.is_some() {
        a.unwrap() + b.unwrap()
    } else {
        0
    }
}

#[inline(never)]
pub fn baz(a: Option<usize>, b: Option<usize>) -> usize {
    if let Some(a) = a {
        if let Some(b) = b {
            return a + b;
        }
    }
    0
}

#[inline(never)]
fn bench(b: &mut Bencher, f: fn(Option<usize>, Option<usize>) -> usize) {
    let seed: &[_] = &[1, 2, 3, 4];
    let mut rng: rand::StdRng = rand::SeedableRng::from_seed(seed);
    let data : Vec<_> = rng.gen_iter::<(Option<usize>, Option<usize>)>().take(1000).collect();
    b.iter(|| {
        data.iter().fold(0, |acc, &(a, b)| acc + f(a, b))
    });
}

#[bench]
fn bench_foo(b: &mut Bencher) {
    bench(b, foo)
}

#[bench]
fn bench_bar(b: &mut Bencher) {
    bench(b, bar)
}

#[bench]
fn bench_baz(b: &mut Bencher) {
    bench(b, baz)
}

Multiple attempts, even though the error margins are high, all put bench_foo as the slowest on my machine (i7-7500U), e.g.

test bench_bar ... bench:       2,459 ns/iter (+/- 343)
test bench_baz ... bench:       2,427 ns/iter (+/- 208)
test bench_foo ... bench:       2,550 ns/iter (+/- 148)

Of note, removing the #[inline(never)] on bench changes the results and make bench_foo as fast as bench_baz, but now bench_bar is slower, although the code for the functions themselves hasn't changed. Also note that I originally wrote bench with generics instead of function pointer, and no #[inline(never)] at all, and that was giving totally different results too.

Micro-benchmarks are hard.

I'm inclined to think the current version above is the fairer.

nox · 2018-04-14T16:51:56Z

With #49420, the LLVM IR for the 3 functions become:

Click to expand LLVM IR.

; issue_48200::foo
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003foo17h7f3d764b4e2d0d3cE(i64 %a.0, i64 %a.1, i64 %b.0, i64 %b.1) unnamed_addr #0 {
start:
  %cond = icmp eq i64 %a.0, 1
  %cond1 = icmp eq i64 %b.0, 1
  %or.cond = and i1 %cond, %cond1
  %0 = add i64 %b.1, %a.1
  %_0.0 = select i1 %or.cond, i64 %0, i64 0
  ret i64 %_0.0
}

; issue_48200::bar
; Function Attrs: uwtable
define i64 @_ZN11issue_482003bar17he38cd8ae2cebf23eE(i64, i64, i64, i64) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %switch.i = icmp eq i64 %0, 1
  %switch.i2 = icmp eq i64 %2, 1
  %or.cond = and i1 %switch.i, %switch.i2
  %4 = add i64 %3, %1
  %_0.0 = select i1 %or.cond, i64 %4, i64 0
  ret i64 %_0.0
}

; issue_48200::baz
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003baz17h789590496be40da2E(i64, i64, i64, i64) unnamed_addr #0 {
start:
  %cond = icmp eq i64 %0, 1
  %cond1 = icmp eq i64 %2, 1
  %or.cond = and i1 %cond, %cond1
  %4 = add i64 %3, %1
  %_0.0 = select i1 %or.cond, i64 %4, i64 0
  ret i64 %_0.0
}

As you can see, the three functions are now identical.

For reference, this is the produced assembly with it:

Click to expand ASM.

	.section	__TEXT,__text,regular,pure_instructions
	.globl	__ZN11issue_482003foo17h7f3d764b4e2d0d3cE
	.p2align	4, 0x90
__ZN11issue_482003foo17h7f3d764b4e2d0d3cE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc

	.globl	__ZN11issue_482003bar17he38cd8ae2cebf23eE
	.p2align	4, 0x90
__ZN11issue_482003bar17he38cd8ae2cebf23eE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc

	.globl	__ZN11issue_482003baz17h789590496be40da2E
	.p2align	4, 0x90
__ZN11issue_482003baz17h789590496be40da2E:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc


.subsections_via_symbols

scottmcm · 2018-07-04T10:29:02Z

Looks like all three are identical in nightly, so closing: https://play.rust-lang.org/?gist=4785ad9a9540c82a45ed4c1a42ee2450&version=nightly&mode=release&edition=2015

In fact, some are so identical that the merge-identical-methods LLVM pass got rid of one 🙂

playground::foo:
	jmp	playground::baz@PLT

glandium · 2018-07-04T10:53:03Z

In fact, some are so identical that the merge-identical-methods LLVM pass got rid of one

O_o why not both? Also, does rust actually care about function addresses? Because if not, it could use aliases rather than dummy functions that jump.

glandium · 2018-07-04T10:56:59Z

Also note that this might have only changed because those Option are now passed by value (two registers each), rather than as pseudo-refs (one pointer each, presumably to somewhere on the stack).

scottmcm · 2018-07-04T11:03:34Z

why not both?

Looking at the LLVM, it seems two are nounwind but the other still has a personality.

scottmcm closed this as completed Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising optimization differences between variants of the same code #48200

Surprising optimization differences between variants of the same code #48200

glandium commented Feb 14, 2018 •

edited

Loading

glandium commented Feb 14, 2018

leonardo-m commented Feb 14, 2018

glandium commented Feb 14, 2018

nox commented Apr 14, 2018 •

edited

Loading

scottmcm commented Jul 4, 2018

glandium commented Jul 4, 2018

glandium commented Jul 4, 2018

scottmcm commented Jul 4, 2018

Surprising optimization differences between variants of the same code #48200

Surprising optimization differences between variants of the same code #48200

Comments

glandium commented Feb 14, 2018 • edited Loading

glandium commented Feb 14, 2018

leonardo-m commented Feb 14, 2018

glandium commented Feb 14, 2018

nox commented Apr 14, 2018 • edited Loading

scottmcm commented Jul 4, 2018

glandium commented Jul 4, 2018

glandium commented Jul 4, 2018

scottmcm commented Jul 4, 2018

glandium commented Feb 14, 2018 •

edited

Loading

nox commented Apr 14, 2018 •

edited

Loading