Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature faster reduce_sum #2162

Merged
merged 49 commits into from
Dec 3, 2020
Merged

Feature faster reduce_sum #2162

merged 49 commits into from
Dec 3, 2020

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Oct 23, 2020

Summary

The current reduce_sum parallel AD implementation suffers from performance bottlenecks whenever many shared arguments are being used. The issue is that the input operands must be copied per work chunk as scheduled by the Intel TBB. This creates potentially a large burden on the AD system. This PR limits the number of deep copies of the share input operands to the number of used threads. The merits of this approach have been benchmarked and shown here.

The idea of this PR is to introduce scoped chainable stacks which allow functions to be executed and record their AD tape bits and pieces onto a separate AD tape. Thus, this allows to write the AD tape of a running function into an instance which is not necessarily the AD tape instance of the respective thread.

Tests

All reduce_sum tests still run just fine. Additional components introduced still need tests.

Side Effects

Makes things faster.

Release notes

Speedup reduce_sum parallelism.

Checklist

  • Math issue speedup reduce_sum #2165

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.15 3.26 0.96 -3.81% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.0 -0.13% slower
eight_schools/eight_schools.stan 0.12 0.11 1.1 8.69% faster
gp_regr/gp_regr.stan 0.18 0.18 0.98 -2.47% slower
irt_2pl/irt_2pl.stan 5.77 5.7 1.01 1.15% faster
performance.compilation 90.95 88.79 1.02 2.38% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.43 8.44 1.0 -0.1% slower
pkpd/one_comp_mm_elim_abs.stan 30.34 29.19 1.04 3.79% faster
sir/sir.stan 131.6 128.65 1.02 2.24% faster
gp_regr/gen_gp_data.stan 0.04 0.05 0.9 -10.58% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.96 2.97 1.0 -0.2% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.4 0.95 -5.33% slower
arK/arK.stan 1.79 1.81 0.99 -0.98% slower
arma/arma.stan 0.61 0.61 1.0 -0.15% slower
garch/garch.stan 0.7 0.69 1.01 1.15% faster
Mean result: 0.998839050598

Jenkins Console Log
Blue Ocean
Commit hash: 9f6da2c


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.15 3.07 1.03 2.51% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.01 0.96% faster
eight_schools/eight_schools.stan 0.12 0.12 1.02 1.94% faster
gp_regr/gp_regr.stan 0.18 0.18 1.03 3.37% faster
irt_2pl/irt_2pl.stan 5.66 5.74 0.99 -1.38% slower
performance.compilation 91.14 88.2 1.03 3.22% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.59 8.46 1.02 1.54% faster
pkpd/one_comp_mm_elim_abs.stan 29.4 30.21 0.97 -2.75% slower
sir/sir.stan 129.5 130.83 0.99 -1.03% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 0.19% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.96 2.96 1.0 -0.05% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.4 0.98 -1.57% slower
arK/arK.stan 1.8 1.81 0.99 -0.63% slower
arma/arma.stan 0.64 0.63 1.01 1.39% faster
garch/garch.stan 0.67 0.67 1.0 0.26% faster
Mean result: 1.0056496204

Jenkins Console Log
Blue Ocean
Commit hash: 9cf713c


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Nov 10, 2020

@andrjohns it would be great if you could verify on your end the merits of this PR on your machine (which is a Linux, right?).

For that a end-to-end test is probably the best thing. I prepared this little R script which runs an updated brms threading vignette twice with different cmdstan installations. So please configure two cmdstan installations (one with 2.25.0 and using 2.25.0 cmdstan together with this math branch). It should be straightforward for you to verify then the speedups by this PR.

Once we have that secured I will do the simplification as mentioned above... an I still owe you some tests for the scope chainable stack.

Does that work as a plan?

Here is the benchmark:

compare_R.txt

@wds15 wds15 marked this pull request as ready for review November 10, 2020 20:53
@wds15
Copy link
Contributor Author

wds15 commented Nov 10, 2020

@andrjohns, @bbbales2 or @SteveBronder this should be good to go now. Tests for the new scoped thing are in. If you think there are more creative ways to test it, we can add them, of course.

Before it goes in, it would be great if my promising benchmarks can be reconfirmed by someone with the brms threading example or/and with the microbenchmark I wrote here:
reduce_sum_benchmark_cpp.txt

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.09 3.57 0.87 -15.34% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.92 -8.79% slower
eight_schools/eight_schools.stan 0.12 0.12 1.03 2.77% faster
gp_regr/gp_regr.stan 0.17 0.16 1.02 2.33% faster
irt_2pl/irt_2pl.stan 5.71 5.7 1.0 0.14% faster
performance.compilation 86.51 85.82 1.01 0.8% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.48 8.55 0.99 -0.79% slower
pkpd/one_comp_mm_elim_abs.stan 29.39 29.57 0.99 -0.6% slower
sir/sir.stan 138.68 132.78 1.04 4.25% faster
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 -0.33% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 3.01 0.98 -1.83% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.38 1.0 0.38% faster
arK/arK.stan 1.8 1.79 1.0 0.33% faster
arma/arma.stan 0.61 0.6 1.02 1.54% faster
garch/garch.stan 0.75 0.59 1.28 21.88% faster
Mean result: 1.01068130199

Jenkins Console Log
Blue Ocean
Commit hash: d48c275


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Nov 23, 2020

Hmm... I was curious and ran the microbenchmark on a Linux machine. The result looks good:

runtime

So things improve a lot for the shared case.

However, I also saw occasionally failures of the nesting_gradient test. I am not yet sure if this is due to the machine or a real thing. Until this is resolved this should not be merged (despite the nice speed gain, but a nested pattern needs to be supported).

@wds15
Copy link
Contributor Author

wds15 commented Nov 24, 2020

Removing one forward and simplifying the code a little bit gives me back a fully stable nesting_gradient test on Linux.

The performance is still just fine from what I can see.

@andrjohns , @SteveBronder or @bbbales2 up for a review?

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.63 3.57 1.02 1.75% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.97 -3.3% slower
eight_schools/eight_schools.stan 0.11 0.11 1.0 0.44% faster
gp_regr/gp_regr.stan 0.17 0.17 0.98 -2.29% slower
irt_2pl/irt_2pl.stan 5.66 5.68 1.0 -0.5% slower
performance.compilation 86.82 86.07 1.01 0.86% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.44 8.47 1.0 -0.32% slower
pkpd/one_comp_mm_elim_abs.stan 31.49 29.67 1.06 5.79% faster
sir/sir.stan 133.33 136.09 0.98 -2.07% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 0.17% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.96 2.96 1.0 -0.13% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.37 0.37 1.0 -0.02% slower
arK/arK.stan 2.47 2.47 1.0 -0.06% slower
arma/arma.stan 0.59 0.59 1.0 0.42% faster
garch/garch.stan 0.74 0.74 1.0 0.09% faster
Mean result: 1.00094339248

Jenkins Console Log
Blue Ocean
Commit hash: c08489b


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@rok-cesnovar
Copy link
Member

@andrjohns would you mind looking at this one?

@andrjohns
Copy link
Collaborator

Yep, will take a look at this tomorrow (sorry for the delay so far!)

andrjohns
andrjohns previously approved these changes Dec 2, 2020
Copy link
Collaborator

@andrjohns andrjohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a lot to add here, it was all pretty elegant in its simplicity. Just a couple of queries, feel free to ignore and merge if not relevant.

This should also generalise pretty easily to the current parallel_map design, since I can just use the indexing function in the scoped deep copy

* @return Result of evaluated functor
*/
template <typename F, typename... Args>
auto execute(F&& f, Args&&... args) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you'll want auto&& or decltype(auto) when returning forwarding references

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you want decltype(auto) in such cases. Although in this case forwarding is not necessary at all - forwarding a functor before calling it does not do anything. So the function can just as well accept functor argument by const reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then let's go with decltype(auto) ... but then you are saying we do not need forwarding? I am not an expert on that matter (yet). When I look at the TBB doc they have these variants in a similar context:

        // Supported since C++11
        template<typename F> auto execute(F& f) -> decltype(f());
        template<typename F> auto execute(const F& f) -> decltype(f());
        template<typename F> void enqueue(F&& f);

What, @t4c1 , would you suggest to implement in our case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my response was really not exact. For this function I would change the signature like this:

Suggested change
auto execute(F&& f, Args&&... args) {
auto execute(const F& f, Args&&... args) {

And remove the std::forward call. Since this is not actually ever returning references (unless f is returning a reference, but I think that is never actually needed) the return type can remain auto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I tried the const thing, but then I can't do everything I want this thing to be able to handle. I wrote a test which fails to compile with the const thing where the functor called changes internal state of the functor. In this case the operator() is non-const and as such the const F& does not compile.

From comparing with how other libraries handle this a decltype(auto) as return type sounds reasonable to me.

... and the forward thing: I thought this is necessary to avoid calls to copy-constructors for cases when the operator()() functor creates internally some object which is then returned by value... no? Or does that mean we want to move here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know modifying internal state of the functor is a realistic use case. Accepting the functor as F&& f is fine than.

About forwarding: Only if the functor overloads both operator()() const& and operator()() && with different behavior. Otherwise neither forward nor move affect how this works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know modifying internal state of the functor is a realistic use case.

Yeah, same. What's the use case?

// not point
// back to main autodiff stack

if (!local_args_tuple_scope_.args_tuple_holder_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick sanity check to make sure that I understand this right. Because this is getting called in each 'chunk' of work, the conditional is checking whether the deep copy has already happened for the given thread that the chunk belongs to.

How does it ensure that there is a different copy for each thread? Is that what the std::mutex is doing in scoped_chainablestack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional checks if the local copy of the shared arguments has already been made for the particular reducer instance. The TBB may use more than one of these workers then there are thread, but the number of workers should be at the order of how many threads we have. Note that the TBB schedules work chunks by using the reducer instances; these work chunks (or tasks) are handed to the scheduler which executes them in whatever thread. Since reducers are used multiple times for multiple ranges it can happen that one reducer gets to run in different threads.

How does it ensure that there is a different copy for each thread?

We do not need to check for threads here at all anymore, since everything is referred to from the reduce instances (but only one instance of the reducers can ever be active in a given thread, of course).

The mutex in scoped_chainablestack will prevent that for a particular instance the execute is called simultaneously from different threads. Whenever that happens the program will terminate as it really must never happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, makes sense thanks

@wds15
Copy link
Contributor Author

wds15 commented Dec 2, 2020

@andrjohns thanks for your review so far. I hope my answers make sense to you. I have added one more test to ensure the scoped_chainablstack does what I want to (allow the functor to change the internal state).

@t4c1 I hope you are good with things as well.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.47 3.43 1.01 1.09% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.0 0.0% slower
eight_schools/eight_schools.stan 0.12 0.11 1.01 0.85% faster
gp_regr/gp_regr.stan 0.16 0.17 0.99 -0.62% slower
irt_2pl/irt_2pl.stan 5.74 5.79 0.99 -0.83% slower
performance.compilation 87.25 85.7 1.02 1.78% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.43 8.46 1.0 -0.28% slower
pkpd/one_comp_mm_elim_abs.stan 29.0 29.18 0.99 -0.6% slower
sir/sir.stan 128.84 130.69 0.99 -1.44% slower
gp_regr/gen_gp_data.stan 0.04 0.05 0.98 -2.22% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.92 2.93 1.0 -0.42% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.38 1.03 3.05% faster
arK/arK.stan 1.77 1.77 1.0 0.17% faster
arma/arma.stan 0.73 0.73 0.99 -0.69% slower
garch/garch.stan 0.61 0.61 1.0 -0.15% slower
Mean result: 0.999946939435

Jenkins Console Log
Blue Ocean
Commit hash: 16521ed


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Copy link
Collaborator

@andrjohns andrjohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, and it looks like you addressed @t4c1's forwarding comments

@wds15 wds15 merged commit e98bfbd into develop Dec 3, 2020
@rok-cesnovar rok-cesnovar mentioned this pull request Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants