Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use immediately invoked lambdas in size and range error checks #2255

Merged
merged 10 commits into from
Dec 21, 2020

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Dec 11, 2020

Summary

See issue #2249 for technical info. This uses immediately invoked lambdas in size and range error checks. This makes the actual code for the error checks smaller which should help the compiler when attempting to inline these.

This also adds some useful compiler attribute macros. One to call out in particular is STAN_REMOVE_RANGE_AND_SIZE_CHECKS which is the user passes in at compile time will turn off all range and size checks

Tests

Refactor so no new tests

Side Effects

We should generally do this pattern in all the error checks so in a future PR would be good to do that

Release notes

Use immediately invoked lambdas in size and range error checks

Checklist

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.63 3.55 1.02 2.36% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.95 -5.81% slower
eight_schools/eight_schools.stan 0.13 0.11 1.1 9.43% faster
gp_regr/gp_regr.stan 0.16 0.15 1.0 0.49% faster
irt_2pl/irt_2pl.stan 5.83 5.96 0.98 -2.16% slower
performance.compilation 88.11 86.33 1.02 2.02% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.41 8.41 1.0 0.0% slower
pkpd/one_comp_mm_elim_abs.stan 29.83 30.01 0.99 -0.63% slower
sir/sir.stan 141.79 132.08 1.07 6.84% faster
gp_regr/gen_gp_data.stan 0.05 0.05 1.02 1.78% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.91 2.92 1.0 -0.17% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.39 0.97 -3.25% slower
arK/arK.stan 2.47 1.78 1.39 28.05% faster
arma/arma.stan 0.58 0.74 0.79 -26.53% slower
garch/garch.stan 0.61 0.53 1.14 12.02% faster
Mean result: 1.02979338971

Jenkins Console Log
Blue Ocean
Commit hash: e23b147


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@bbbales2
Copy link
Member

I'm not in for the STAN_NO_RANGE_AND_SIZE_CHECK macros. I see the use for experimenting, but it just seems like another code branch to maintain.

The COLD_PATH stuff... Wow what a speedup. Looks worth it

@SteveBronder
Copy link
Collaborator Author

In totally fine with removing the no check macro. It was mostly just an idea

Copy link
Contributor

@t4c1 t4c1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand I like STAN_NO_RANGE_AND_SIZE_CHECK. I believe we already have some functions in which significant amount of time is spent in checks. Also this is not much work to maintain. It is just an option for early return, without any special cases. Although we may be able to find a better name for this macro. Maybe STAN_RETURN_IF_NO_CHECKS?

@wds15
Copy link
Contributor

wds15 commented Dec 14, 2020

What about STAN_NDEBUG ?

@SteveBronder
Copy link
Collaborator Author

What about STAN_NDEBUG ?

Oh wait, why are we reinventing the wheel here why don't we just use NDEBUG??

@wds15
Copy link
Contributor

wds15 commented Dec 14, 2020

NDEBUG is fine for me as well...

@bbbales2
Copy link
Member

On the other hand I like STAN_NO_RANGE_AND_SIZE_CHECK

It just seems more complicated. Like how is this exposed to the surface? The cold path stuff is for free to the user.

I think this would definitely make sense for the O1/O2/O3 discussion. Like, turn off all nan/inf (the expensive ones) at O3 or something -- but that's just a different thing.

@SteveBronder
Copy link
Collaborator Author

It's exposed to the surface by the user defining -DNDEBUG so that at compile time it knows these checks are no ops. It's commonly used in things like the standard library to turn off debug information.

https://en.m.wikibooks.org/wiki/C_Programming/assert.h

@bbbales2
Copy link
Member

Without a need for this outside of benchmarking Math, or without a plan to expose them at the interfaces, they're really not doing anything and I don't see why we'd add them. We can always add them once we've got a good reason. The idea doesn't go away if we don't put them in now.

I wouldn't call this debug information. It is the intended behavior of the Math library to do these checks. The COLD_PATH stuff is different if I'm understanding correctly. Big performance difference and anyone using gcc gets it for free.

@wds15
Copy link
Contributor

wds15 commented Dec 14, 2020

While I do like the idea to be able to kill those checks for the sake of speed, I do concur here that this is actually a separate thing to sort out. These checks have triggered many debates - heated ones - and it‘s probably better to do that in a dedicated PR.

(I am in favor for having the knob to turn this off, but still let‘s do one by one)

@rok-cesnovar
Copy link
Member

I agree this should be a separate thing and should not be NDEBUG. These checks are not really debug information.

@SteveBronder
Copy link
Collaborator Author

Without a need for this outside of benchmarking Math, or without a plan to expose them at the interfaces,

I'm not sure I'm following, what do you mean when you say we don't have a plan to expose these at the user interface? Or maybe a better question is, what does that look like to you? In my mindplace the user can just add -DNDEBUG to their Makevars in R and cmdstan users would put that in make/local

they're really not doing anything and I don't see why we'd add them. We can always add them once we've got a good reason. The idea doesn't go away if we don't put them in now.

It does do something, which is that it makes the functions do nothing! Since this makes these functions no-ops the compiler completely ignores them

I wouldn't call this debug information. It is the intended behavior of the Math library to do these checks.

Yes and the std library treats it the same way where the intended behavior of assert() is to stop the program when something goes wrong. Then if you have NDEBUG on it changes the behavior to not stop the program. This only effects things if the user ops into it (aka passing -DNDEBUG at compile time)

The COLD_PATH stuff is different if I'm understanding correctly. Big performance difference and anyone using gcc gets it for free.

There's two things happening here.

  1. COLD_PATH puts the error code on a code path that the CPU will never try to branch predict / preload anything related to the error path. __GNUC__ is actually defined for both clang and gcc so both compiler users will get this benefit.

  2. When the user passes -DNDEBUG to the compiler these ops are completely turned off so that the compiler sees they are no-ops and doesn't bother even adding them to the binary.

NDEBUG is only meant to be used for "release" versions of a program. So the user would only turn this on once they've run this program once and made sure all their sizes /ranges are correct.

@bbbales2
Copy link
Member

So the user would only turn this on once they've run this program once and made sure all their sizes /ranges are correct.

Yeah, the use case makes sense.

Dropping checks has come up before (I've heard @seantalts talk about this, and @bgoodri may have as well, though he can correct me). We could probably optionally turn off nan/infs/positive definite and bounds checks separately.

It's a popular idea just needs talked to the interfaces, because that's where it will get used.

I'm not sure I'm following, what do you mean when you say we don't have a plan to expose these at the user interface?

The interfaces mean how will people use it from cmdstan, cmdstanr, cmdstanpy, rstan, or pystan. Things like the threading on/off flags haven't been a walk in the park with how make and the precompiled headers work (though I think Rok has some sort of solution there now) even though there is a clear use case. rstanarm and brms and whatnot could probably use it too.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.75 3.4 1.1 9.45% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.2% slower
eight_schools/eight_schools.stan 0.11 0.11 1.0 -0.49% slower
gp_regr/gp_regr.stan 0.15 0.16 0.99 -1.07% slower
irt_2pl/irt_2pl.stan 5.4 5.6 0.96 -3.8% slower
performance.compilation 88.56 85.97 1.03 2.92% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.37 8.36 1.0 0.03% faster
pkpd/one_comp_mm_elim_abs.stan 30.45 29.7 1.03 2.48% faster
sir/sir.stan 141.96 133.19 1.07 6.18% faster
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 -0.25% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.92 2.91 1.0 0.26% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.39 0.96 -3.65% slower
arK/arK.stan 2.47 1.78 1.39 27.86% faster
arma/arma.stan 0.58 0.74 0.79 -26.49% slower
garch/garch.stan 0.6 0.54 1.12 10.72% faster
Mean result: 1.02824665276

Jenkins Console Log
Blue Ocean
Commit hash: 1e10fc5


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

Dropping checks has come up before (I've heard @seantalts talk about this, and @bgoodri may have as well, though he can correct me). We could probably optionally turn off nan/infs/positive definite and bounds checks separately.

imo I'd rather not turn these off since the turning off non/inf/ number bounds checks is usually associated with things like -Ofast and imo is a bit more of a problem. We could potentially turn those off but it's a much riskier thing than turning of the size and range checks.

The interfaces mean how will people use it from cmdstan, cmdstanr, cmdstanpy, rstan, or pystan. Things like the threading on/off flags haven't been a walk in the park with how make and the precompiled headers work (though I think Rok has some sort of solution there now) even though there is a clear use case. rstanarm and brms and whatnot could probably use it too.

Why wouldn't the user just add -DNDEBUG to their Makevars in R and cmdstan users would put that in make/local? @rok-cesnovar would this mess the the pre-complied headers? My only guess is that the precompiled header would not have NDEBUG defined so worst case this wouldn't do anything

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 15, 2020

Why wouldn't the user just add -DNDEBUG to their Makevars in R and cmdstan users would put that in make/local?

One fear I have for this is that some more advanced users may already have this flag and expect to still get the "normal" Stan behavior. Though that can be avoided by prominently displaying this in the release notes. I still think we should avoid NDEBUG.

I am a bit less worried if this only concerns indexing bounds checks and not element-wise checks for nan/inf. For those I would be worried.

would this mess the the precomplied headers?

I dont think so, worst case the user needs to rebuild cmdstan after setting those. That is advised in general to anyone that touches CXXFLAGS directly. We have "quality-of-life" improvements when using STAN_THREADS, STAN_OPENCL and STAN_MPI (no need to rebuild manually) because that is what we expect users to use and that is what we promote or will promote. Anyone setting arbitrary C++ flags that we dont document or publicize/promote is "on their own". We have zero to little knowledge of how Stan behaves with various g++/clang++ flags and we also probably dont have the resources to try/test most of them.

For rstan/pystan I have no idea how/if they handle prebuilding main.o and CRTP and all that. I think we should not cause more issue for those tightly-coupled interfaces by suggesting users to try NDEBUG there. If someone is this concerned with speed they should be using cmdstan in the first place, imo.

@t4c1
Copy link
Contributor

t4c1 commented Dec 15, 2020

I still think adding an option to turn off checks is a good idea, but I think it should not use NDEBUG macro. Also it might be even better to enable finer control by having two macros - one for bounds checks and one for value checks. However, turning off checks is a separate topic and should probably be in a separate PR.

@SteveBronder
Copy link
Collaborator Author

I removed the macro to remove checks right now, but we should have something like that in the future

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.45 3.41 1.01 1.14% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.94 -5.97% slower
eight_schools/eight_schools.stan 0.11 0.12 0.96 -4.28% slower
gp_regr/gp_regr.stan 0.15 0.15 1.0 0.39% faster
irt_2pl/irt_2pl.stan 5.39 5.19 1.04 3.69% faster
performance.compilation 87.13 86.25 1.01 1.01% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.38 8.77 0.96 -4.61% slower
pkpd/one_comp_mm_elim_abs.stan 31.13 29.34 1.06 5.75% faster
sir/sir.stan 144.3 129.7 1.11 10.12% faster
gp_regr/gen_gp_data.stan 0.05 0.04 1.04 3.38% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.93 3.06 0.96 -4.2% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.44 0.88 -13.38% slower
arK/arK.stan 2.46 1.84 1.34 25.15% faster
arma/arma.stan 0.59 0.62 0.94 -5.97% slower
garch/garch.stan 0.61 0.6 1.03 2.57% faster
Mean result: 1.01858649628

Jenkins Console Log
Blue Ocean
Commit hash: a57c88a


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

Is this good to go?

@t4c1
Copy link
Contributor

t4c1 commented Dec 18, 2020

No. You have not addressed everything from my last review.

@SteveBronder
Copy link
Collaborator Author

Whups sorry missed that, removed the extra compiler attributes

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.37 3.37 1.0 -0.06% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -1.67% slower
eight_schools/eight_schools.stan 0.11 0.11 0.98 -1.95% slower
gp_regr/gp_regr.stan 0.16 0.16 1.0 0.29% faster
irt_2pl/irt_2pl.stan 5.42 5.17 1.05 4.59% faster
performance.compilation 89.07 88.42 1.01 0.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.4 8.64 0.97 -2.86% slower
pkpd/one_comp_mm_elim_abs.stan 29.94 29.71 1.01 0.78% faster
sir/sir.stan 137.32 133.27 1.03 2.95% faster
gp_regr/gen_gp_data.stan 0.04 0.05 0.92 -9.09% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.91 3.05 0.95 -4.83% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.39 1.02 2.03% faster
arK/arK.stan 2.5 2.52 0.99 -0.86% slower
arma/arma.stan 0.88 0.61 1.44 30.66% faster
garch/garch.stan 0.53 0.68 0.79 -27.03% slower
Mean result: 1.00966011654

Jenkins Console Log
Blue Ocean
Commit hash: f9f0dc0


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1 t4c1 merged commit effd263 into develop Dec 21, 2020
@rok-cesnovar rok-cesnovar deleted the feature/cold-path-range-size-errors branch December 21, 2020 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants