Skip to content
This repository was archived by the owner on May 17, 2020. It is now read-only.

Allow users to disable fp contract optimizations ? #91

Open
mwarusz opened this issue Jul 10, 2019 · 1 comment
Open

Allow users to disable fp contract optimizations ? #91

mwarusz opened this issue Jul 10, 2019 · 1 comment

Comments

@mwarusz
Copy link

mwarusz commented Jul 10, 2019

Recently I ran into a surprising (for me) behavior demonstrated by the MWE below

using CuArrays, CUDAnative, GPUifyLoops

function kernel(rho, T)
  P = rho[1] * T[1]

  if (abs(P - P) > 1e-16)
    @cuprintf("diff = %.16e\n", P - P)
  end
  nothing
end

rho = CuArray([1e-1])
T = CuArray([300.0])
@launch CUDA() kernel(rho, T, threads=1, blocks=1)

with the output

diff = 1.6653345369377348e-15

Basically, if my understanding of the generated PTX is correct, what happens
is that P - P is calculated as fma(rho[1], T[1], -P) which is probably not the smartest move by the compiler. However, clang with LLVM-6.0.1 also does this for CUDA C so I guess that's expected. This issue goes away if I disable contracts. In clang there's an option for that called -ffp-contract Maybe adding a similar option in GPUifyLoops would be helpful for debugging ?

For convenience, the generated PTX can be found here:

https://gist.github.com/mwarusz/5ab4ac99b02e77b54178cd95c9820d7b

@vchuravy
Copy link
Owner

Thanks for bringing this up, the goal in #55 was indeed to match Clang (we were hunting down a performance gap).

I agree that the fact that we use contract unconditionally is probably not what we want in the long-term. Julia in general tries to provide localised control to the user (compare @fastmath).

But yeah:

	mul.f64 	%fd4, %fd1, %fd3;
	neg.f64 	%fd5, %fd4;
	fma.rn.f64 	%fd2, %fd1, %fd3, %fd5;
	abs.f64 	%fd6, %fd2;

is kinda funny, the only explanation I have is that the fma units are the fastest thing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants