Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising optimization differences between variants of the same code #48200

Closed
glandium opened this issue Feb 14, 2018 · 8 comments
Closed

Surprising optimization differences between variants of the same code #48200

glandium opened this issue Feb 14, 2018 · 8 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@glandium
Copy link
Contributor

glandium commented Feb 14, 2018

I was looking at the assembly output of the following code, to see if rust would do a short-circuit or not:

pub fn foo(a: Option<usize>, b: Option<usize>) -> usize {
    if let (Some(a), Some(b)) = (a, b) {
        a + b
    } else {
        0
    }
}

And got the following result:

example::foo:
  push rbp
  mov rbp, rsp
  cmp qword ptr [rdi], 1
  jne .LBB0_1
  mov rcx, qword ptr [rdi + 8]
  add rcx, qword ptr [rsi + 8]
  xor eax, eax
  cmp qword ptr [rsi], 1
  cmove rax, rcx
  pop rbp
  ret
.LBB0_1:
  xor eax, eax
  pop rbp
  ret

(thanks godbolt)

Which, come to think of it, might make sense, although I'm not sure reading the data and making the addition before checking the second Option tag is better than avoiding the branch this all allows to avoid. So, let's assume this is actually better than doing two compare/branch at the beginning, at the very least, the compiled code should be the same as for either of the following:

pub fn bar(a: Option<usize>, b: Option<usize>) -> usize {
    if a.is_some() && b.is_some() {
        a.unwrap() + b.unwrap()
    } else {
        0
    }
}

pub fn baz(a: Option<usize>, b: Option<usize>) -> usize {
    if let Some(a) = a {
        if let Some(b) = b {
            return a + b;
        }
    }
    0
}

Both compile to the code below, which is different than the original one:

  push rbp
  mov rbp, rsp
  cmp qword ptr [rdi], 1
  jne .LBB1_1
  cmp qword ptr [rsi], 1
  jne .LBB1_3
  mov rax, qword ptr [rsi + 8]
  add rax, qword ptr [rdi + 8]
  pop rbp
  ret
.LBB1_1:
  xor eax, eax
  pop rbp
  ret
.LBB1_3:
  xor eax, eax
  pop rbp
  ret

BTW, note how it generates two branch targets with the same return code instead of reusing the first one for the second branch.

@kennytm kennytm added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Feb 14, 2018
@Centril Centril added A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Feb 14, 2018
@kennytm kennytm added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Feb 14, 2018
@glandium
Copy link
Contributor Author

According to godbolt, 1.11.0 is the last version that generated two compare/branches for the first rust code, like the others, although it still did generate slightly different code for the last two:

example::foo:
  push rbp
  mov rbp, rsp
  xor eax, eax
  cmp qword ptr [rsi], 1
  jne .LBB0_4
  cmp qword ptr [rdi], 1
  jne .LBB0_3
  mov rax, qword ptr [rdi + 8]
  add rax, qword ptr [rsi + 8]
  pop rbp
  ret
.LBB0_3:
  xor eax, eax
.LBB0_4:
  pop rbp
  ret

example::bar:
  push rbp
  mov rbp, rsp
  xor eax, eax
  cmp qword ptr [rdi], 1
  jne .LBB1_3
  xor eax, eax
  cmp qword ptr [rsi], 1
  jne .LBB1_3
  mov rax, qword ptr [rsi + 8]
  add rax, qword ptr [rdi + 8]
.LBB1_3:
  pop rbp
  ret

Note there is no duplicate return code there.

@leonardo-m
Copy link

Timing the various versions with a simple benchmark could be useful.

@glandium
Copy link
Contributor Author

Benched with:

#![feature(test)]

extern crate rand;
extern crate test;

use rand::Rng;
use test::Bencher;

#[inline(never)]
pub fn foo(a: Option<usize>, b: Option<usize>) -> usize {
    if let (Some(a), Some(b)) = (a, b) {
        a + b
    } else {
        0
    }
}

#[inline(never)]
pub fn bar(a: Option<usize>, b: Option<usize>) -> usize {
    if a.is_some() && b.is_some() {
        a.unwrap() + b.unwrap()
    } else {
        0
    }
}

#[inline(never)]
pub fn baz(a: Option<usize>, b: Option<usize>) -> usize {
    if let Some(a) = a {
        if let Some(b) = b {
            return a + b;
        }
    }
    0
}

#[inline(never)]
fn bench(b: &mut Bencher, f: fn(Option<usize>, Option<usize>) -> usize) {
    let seed: &[_] = &[1, 2, 3, 4];
    let mut rng: rand::StdRng = rand::SeedableRng::from_seed(seed);
    let data : Vec<_> = rng.gen_iter::<(Option<usize>, Option<usize>)>().take(1000).collect();
    b.iter(|| {
        data.iter().fold(0, |acc, &(a, b)| acc + f(a, b))
    });
}

#[bench]
fn bench_foo(b: &mut Bencher) {
    bench(b, foo)
}

#[bench]
fn bench_bar(b: &mut Bencher) {
    bench(b, bar)
}

#[bench]
fn bench_baz(b: &mut Bencher) {
    bench(b, baz)
}

Multiple attempts, even though the error margins are high, all put bench_foo as the slowest on my machine (i7-7500U), e.g.

test bench_bar ... bench:       2,459 ns/iter (+/- 343)
test bench_baz ... bench:       2,427 ns/iter (+/- 208)
test bench_foo ... bench:       2,550 ns/iter (+/- 148)

Of note, removing the #[inline(never)] on bench changes the results and make bench_foo as fast as bench_baz, but now bench_bar is slower, although the code for the functions themselves hasn't changed. Also note that I originally wrote bench with generics instead of function pointer, and no #[inline(never)] at all, and that was giving totally different results too.

Micro-benchmarks are hard.

I'm inclined to think the current version above is the fairer.

@nox
Copy link
Contributor

nox commented Apr 14, 2018

With #49420, the LLVM IR for the 3 functions become:

Click to expand LLVM IR.
; issue_48200::foo
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003foo17h7f3d764b4e2d0d3cE(i64 %a.0, i64 %a.1, i64 %b.0, i64 %b.1) unnamed_addr #0 {
start:
  %cond = icmp eq i64 %a.0, 1
  %cond1 = icmp eq i64 %b.0, 1
  %or.cond = and i1 %cond, %cond1
  %0 = add i64 %b.1, %a.1
  %_0.0 = select i1 %or.cond, i64 %0, i64 0
  ret i64 %_0.0
}

; issue_48200::bar
; Function Attrs: uwtable
define i64 @_ZN11issue_482003bar17he38cd8ae2cebf23eE(i64, i64, i64, i64) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %switch.i = icmp eq i64 %0, 1
  %switch.i2 = icmp eq i64 %2, 1
  %or.cond = and i1 %switch.i, %switch.i2
  %4 = add i64 %3, %1
  %_0.0 = select i1 %or.cond, i64 %4, i64 0
  ret i64 %_0.0
}

; issue_48200::baz
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN11issue_482003baz17h789590496be40da2E(i64, i64, i64, i64) unnamed_addr #0 {
start:
  %cond = icmp eq i64 %0, 1
  %cond1 = icmp eq i64 %2, 1
  %or.cond = and i1 %cond, %cond1
  %4 = add i64 %3, %1
  %_0.0 = select i1 %or.cond, i64 %4, i64 0
  ret i64 %_0.0
}

As you can see, the three functions are now identical.

For reference, this is the produced assembly with it:

Click to expand ASM.
	.section	__TEXT,__text,regular,pure_instructions
	.globl	__ZN11issue_482003foo17h7f3d764b4e2d0d3cE
	.p2align	4, 0x90
__ZN11issue_482003foo17h7f3d764b4e2d0d3cE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc

	.globl	__ZN11issue_482003bar17he38cd8ae2cebf23eE
	.p2align	4, 0x90
__ZN11issue_482003bar17he38cd8ae2cebf23eE:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc

	.globl	__ZN11issue_482003baz17h789590496be40da2E
	.p2align	4, 0x90
__ZN11issue_482003baz17h789590496be40da2E:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorq	$1, %rdi
	xorq	$1, %rdx
	addq	%rcx, %rsi
	xorl	%eax, %eax
	orq	%rdi, %rdx
	cmoveq	%rsi, %rax
	popq	%rbp
	retq
	.cfi_endproc


.subsections_via_symbols

@scottmcm
Copy link
Member

scottmcm commented Jul 4, 2018

Looks like all three are identical in nightly, so closing: https://play.rust-lang.org/?gist=4785ad9a9540c82a45ed4c1a42ee2450&version=nightly&mode=release&edition=2015

In fact, some are so identical that the merge-identical-methods LLVM pass got rid of one 🙂

playground::foo:
	jmp	playground::baz@PLT

@scottmcm scottmcm closed this as completed Jul 4, 2018
@glandium
Copy link
Contributor Author

glandium commented Jul 4, 2018

In fact, some are so identical that the merge-identical-methods LLVM pass got rid of one

O_o why not both? Also, does rust actually care about function addresses? Because if not, it could use aliases rather than dummy functions that jump.

@glandium
Copy link
Contributor Author

glandium commented Jul 4, 2018

Also note that this might have only changed because those Option are now passed by value (two registers each), rather than as pseudo-refs (one pointer each, presumably to somewhere on the stack).

@scottmcm
Copy link
Member

scottmcm commented Jul 4, 2018

why not both?

Looking at the LLVM, it seems two are nounwind but the other still has a personality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants