Feature faster reduce_sum #2162

wds15 · 2020-10-23T17:09:17Z

Summary

The current reduce_sum parallel AD implementation suffers from performance bottlenecks whenever many shared arguments are being used. The issue is that the input operands must be copied per work chunk as scheduled by the Intel TBB. This creates potentially a large burden on the AD system. This PR limits the number of deep copies of the share input operands to the number of used threads. The merits of this approach have been benchmarked and shown here.

The idea of this PR is to introduce scoped chainable stacks which allow functions to be executed and record their AD tape bits and pieces onto a separate AD tape. Thus, this allows to write the AD tape of a running function into an instance which is not necessarily the AD tape instance of the respective thread.

Tests

All reduce_sum tests still run just fine. Additional components introduced still need tests.

Side Effects

Makes things faster.

Release notes

Speedup reduce_sum parallelism.

Checklist

Math issue speedup reduce_sum #2165
Copyright holder: Sebastian Weber

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

Co-authored-by: Steve Bronder <[email protected]>

…v/math into feature-fast-parallel-ad

…ASE_601/final)

…v/math into feature-fast-parallel-ad

stan-buildbot · 2020-10-24T10:15:53Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.15	3.26	0.96	-3.81% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.0	-0.13% slower
eight_schools/eight_schools.stan	0.12	0.11	1.1	8.69% faster
gp_regr/gp_regr.stan	0.18	0.18	0.98	-2.47% slower
irt_2pl/irt_2pl.stan	5.77	5.7	1.01	1.15% faster
performance.compilation	90.95	88.79	1.02	2.38% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.43	8.44	1.0	-0.1% slower
pkpd/one_comp_mm_elim_abs.stan	30.34	29.19	1.04	3.79% faster
sir/sir.stan	131.6	128.65	1.02	2.24% faster
gp_regr/gen_gp_data.stan	0.04	0.05	0.9	-10.58% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.96	2.97	1.0	-0.2% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.4	0.95	-5.33% slower
arK/arK.stan	1.79	1.81	0.99	-0.98% slower
arma/arma.stan	0.61	0.61	1.0	-0.15% slower
garch/garch.stan	0.7	0.69	1.01	1.15% faster
Mean result: 0.998839050598

Jenkins Console Log
Blue Ocean
Commit hash: 9f6da2c

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

This reverts commit aaadd6b.

stan-buildbot · 2020-11-03T08:47:18Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.15	3.07	1.03	2.51% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.01	0.96% faster
eight_schools/eight_schools.stan	0.12	0.12	1.02	1.94% faster
gp_regr/gp_regr.stan	0.18	0.18	1.03	3.37% faster
irt_2pl/irt_2pl.stan	5.66	5.74	0.99	-1.38% slower
performance.compilation	91.14	88.2	1.03	3.22% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.59	8.46	1.02	1.54% faster
pkpd/one_comp_mm_elim_abs.stan	29.4	30.21	0.97	-2.75% slower
sir/sir.stan	129.5	130.83	0.99	-1.03% slower
gp_regr/gen_gp_data.stan	0.04	0.04	1.0	0.19% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.96	2.96	1.0	-0.05% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.4	0.98	-1.57% slower
arK/arK.stan	1.8	1.81	0.99	-0.63% slower
arma/arma.stan	0.64	0.63	1.01	1.39% faster
garch/garch.stan	0.67	0.67	1.0	0.26% faster
Mean result: 1.0056496204

Jenkins Console Log
Blue Ocean
Commit hash: 9cf713c

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

stan/math/rev/core/scoped_chainablestack.hpp

wds15 · 2020-11-10T13:08:05Z

@andrjohns it would be great if you could verify on your end the merits of this PR on your machine (which is a Linux, right?).

For that a end-to-end test is probably the best thing. I prepared this little R script which runs an updated brms threading vignette twice with different cmdstan installations. So please configure two cmdstan installations (one with 2.25.0 and using 2.25.0 cmdstan together with this math branch). It should be straightforward for you to verify then the speedups by this PR.

Once we have that secured I will do the simplification as mentioned above... an I still owe you some tests for the scope chainable stack.

Does that work as a plan?

Here is the benchmark:

compare_R.txt

…rguments point internally to the same vari

…4.1 (tags/RELEASE_600/final)

wds15 · 2020-11-10T20:56:15Z

@andrjohns, @bbbales2 or @SteveBronder this should be good to go now. Tests for the new scoped thing are in. If you think there are more creative ways to test it, we can add them, of course.

Before it goes in, it would be great if my promising benchmarks can be reconfirmed by someone with the brms threading example or/and with the microbenchmark I wrote here:
reduce_sum_benchmark_cpp.txt

…dev/math into feature-fast-parallel-ad-2

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2020-11-18T23:11:04Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.09	3.57	0.87	-15.34% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.92	-8.79% slower
eight_schools/eight_schools.stan	0.12	0.12	1.03	2.77% faster
gp_regr/gp_regr.stan	0.17	0.16	1.02	2.33% faster
irt_2pl/irt_2pl.stan	5.71	5.7	1.0	0.14% faster
performance.compilation	86.51	85.82	1.01	0.8% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.48	8.55	0.99	-0.79% slower
pkpd/one_comp_mm_elim_abs.stan	29.39	29.57	0.99	-0.6% slower
sir/sir.stan	138.68	132.78	1.04	4.25% faster
gp_regr/gen_gp_data.stan	0.04	0.04	1.0	-0.33% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	3.01	0.98	-1.83% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.38	1.0	0.38% faster
arK/arK.stan	1.8	1.79	1.0	0.33% faster
arma/arma.stan	0.61	0.6	1.02	1.54% faster
garch/garch.stan	0.75	0.59	1.28	21.88% faster
Mean result: 1.01068130199

Jenkins Console Log
Blue Ocean
Commit hash: d48c275

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15 · 2020-11-23T20:37:56Z

Hmm... I was curious and ran the microbenchmark on a Linux machine. The result looks good:

So things improve a lot for the shared case.

However, I also saw occasionally failures of the nesting_gradient test. I am not yet sure if this is due to the machine or a real thing. Until this is resolved this should not be merged (despite the nice speed gain, but a nested pattern needs to be supported).

…lel-ad-2-fix

…dev/math into feature-fast-parallel-ad-2

…lel-ad-2

…4.1 (tags/RELEASE_600/final)

wds15 · 2020-11-24T21:13:06Z

Removing one forward and simplifying the code a little bit gives me back a fully stable nesting_gradient test on Linux.

The performance is still just fine from what I can see.

@andrjohns , @SteveBronder or @bbbales2 up for a review?

stan-buildbot · 2020-11-25T02:03:32Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.63	3.57	1.02	1.75% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.97	-3.3% slower
eight_schools/eight_schools.stan	0.11	0.11	1.0	0.44% faster
gp_regr/gp_regr.stan	0.17	0.17	0.98	-2.29% slower
irt_2pl/irt_2pl.stan	5.66	5.68	1.0	-0.5% slower
performance.compilation	86.82	86.07	1.01	0.86% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.44	8.47	1.0	-0.32% slower
pkpd/one_comp_mm_elim_abs.stan	31.49	29.67	1.06	5.79% faster
sir/sir.stan	133.33	136.09	0.98	-2.07% slower
gp_regr/gen_gp_data.stan	0.04	0.04	1.0	0.17% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.96	2.96	1.0	-0.13% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.37	0.37	1.0	-0.02% slower
arK/arK.stan	2.47	2.47	1.0	-0.06% slower
arma/arma.stan	0.59	0.59	1.0	0.42% faster
garch/garch.stan	0.74	0.74	1.0	0.09% faster
Mean result: 1.00094339248

Jenkins Console Log
Blue Ocean
Commit hash: c08489b

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

rok-cesnovar · 2020-11-28T20:02:42Z

@andrjohns would you mind looking at this one?

andrjohns · 2020-11-29T14:57:36Z

Yep, will take a look at this tomorrow (sorry for the delay so far!)

andrjohns

I didn't have a lot to add here, it was all pretty elegant in its simplicity. Just a couple of queries, feel free to ignore and merge if not relevant.

This should also generalise pretty easily to the current parallel_map design, since I can just use the indexing function in the scoped deep copy

andrjohns · 2020-12-02T11:20:10Z

stan/math/rev/core/scoped_chainablestack.hpp

+   * @return Result of evaluated functor
+   */
+  template <typename F, typename... Args>
+  auto execute(F&& f, Args&&... args) {


I believe you'll want auto&& or decltype(auto) when returning forwarding references

Usually you want decltype(auto) in such cases. Although in this case forwarding is not necessary at all - forwarding a functor before calling it does not do anything. So the function can just as well accept functor argument by const reference.

ok, then let's go with decltype(auto) ... but then you are saying we do not need forwarding? I am not an expert on that matter (yet). When I look at the TBB doc they have these variants in a similar context:

// Supported since C++11 template<typename F> auto execute(F& f) -> decltype(f()); template<typename F> auto execute(const F& f) -> decltype(f()); template<typename F> void enqueue(F&& f);

What, @t4c1 , would you suggest to implement in our case?

Sorry, my response was really not exact. For this function I would change the signature like this:

Suggested change

auto execute(F&& f, Args&&... args) {

auto execute(const F& f, Args&&... args) {

And remove the std::forward call. Since this is not actually ever returning references (unless f is returning a reference, but I think that is never actually needed) the return type can remain auto.

Hmm... I tried the const thing, but then I can't do everything I want this thing to be able to handle. I wrote a test which fails to compile with the const thing where the functor called changes internal state of the functor. In this case the operator() is non-const and as such the const F& does not compile.

From comparing with how other libraries handle this a decltype(auto) as return type sounds reasonable to me.

... and the forward thing: I thought this is necessary to avoid calls to copy-constructors for cases when the operator()() functor creates internally some object which is then returned by value... no? Or does that mean we want to move here?

I did not know modifying internal state of the functor is a realistic use case. Accepting the functor as F&& f is fine than.

About forwarding: Only if the functor overloads both operator()() const& and operator()() && with different behavior. Otherwise neither forward nor move affect how this works.

I did not know modifying internal state of the functor is a realistic use case.

Yeah, same. What's the use case?

andrjohns · 2020-12-02T11:40:40Z

stan/math/rev/functor/reduce_sum.hpp

+      // not point
+      //   back to main autodiff stack
+
+      if (!local_args_tuple_scope_.args_tuple_holder_) {


Just a quick sanity check to make sure that I understand this right. Because this is getting called in each 'chunk' of work, the conditional is checking whether the deep copy has already happened for the given thread that the chunk belongs to.

How does it ensure that there is a different copy for each thread? Is that what the std::mutex is doing in scoped_chainablestack?

The conditional checks if the local copy of the shared arguments has already been made for the particular reducer instance. The TBB may use more than one of these workers then there are thread, but the number of workers should be at the order of how many threads we have. Note that the TBB schedules work chunks by using the reducer instances; these work chunks (or tasks) are handed to the scheduler which executes them in whatever thread. Since reducers are used multiple times for multiple ranges it can happen that one reducer gets to run in different threads.

How does it ensure that there is a different copy for each thread?

We do not need to check for threads here at all anymore, since everything is referred to from the reduce instances (but only one instance of the reducers can ever be active in a given thread, of course).

The mutex in scoped_chainablestack will prevent that for a particular instance the execute is called simultaneously from different threads. Whenever that happens the program will terminate as it really must never happen.

Great, makes sense thanks

…dev/math into feature-fast-parallel-ad-2

…4.1 (tags/RELEASE_600/final)

wds15 · 2020-12-02T17:32:16Z

@andrjohns thanks for your review so far. I hope my answers make sense to you. I have added one more test to ensure the scoped_chainablstack does what I want to (allow the functor to change the internal state).

@t4c1 I hope you are good with things as well.

…dev/math into feature-fast-parallel-ad-2

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2020-12-03T01:31:04Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.47	3.43	1.01	1.09% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.0	0.0% slower
eight_schools/eight_schools.stan	0.12	0.11	1.01	0.85% faster
gp_regr/gp_regr.stan	0.16	0.17	0.99	-0.62% slower
irt_2pl/irt_2pl.stan	5.74	5.79	0.99	-0.83% slower
performance.compilation	87.25	85.7	1.02	1.78% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.43	8.46	1.0	-0.28% slower
pkpd/one_comp_mm_elim_abs.stan	29.0	29.18	0.99	-0.6% slower
sir/sir.stan	128.84	130.69	0.99	-1.44% slower
gp_regr/gen_gp_data.stan	0.04	0.05	0.98	-2.22% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.92	2.93	1.0	-0.42% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.38	1.03	3.05% faster
arK/arK.stan	1.77	1.77	1.0	0.17% faster
arma/arma.stan	0.73	0.73	0.99	-0.69% slower
garch/garch.stan	0.61	0.61	1.0	-0.15% slower
Mean result: 0.999946939435

Jenkins Console Log
Blue Ocean
Commit hash: 16521ed

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

andrjohns

This all looks good to me, and it looks like you addressed @t4c1's forwarding comments

wds15 and others added 13 commits October 20, 2020 11:01

use thread local deep copies for shared variables in reduce_sum

4f8d893

get rid of unused code

338ea4a

cpplint

4f63bbd

Update stan/math/rev/core/scoped_chainablestack.hpp

86d212b

Co-authored-by: Steve Bronder <[email protected]>

Update stan/math/rev/functor/reduce_sum.hpp

c0424a7

Co-authored-by: Steve Bronder <[email protected]>

Update stan/math/rev/core/scoped_chainablestack.hpp

1838e52

Co-authored-by: Steve Bronder <[email protected]>

Update stan/math/rev/core/scoped_chainablestack.hpp

c829bdb

Co-authored-by: Steve Bronder <[email protected]>

Merge branch 'feature-fast-parallel-ad' of https://github.com/stan-de…

3a1f043

…v/math into feature-fast-parallel-ad

[Jenkins] auto-formatting by clang-format version 6.0.1-14 (tags/RELE…

a12d75e

…ASE_601/final)

review comments

576a458

start documentation of scoped chainablestack

30b5287

add doxygen doc

90c821c

Merge branch 'feature-fast-parallel-ad' of https://github.com/stan-de…

9f6da2c

…v/math into feature-fast-parallel-ad

wds15 added 4 commits November 2, 2020 15:38

integrate data structures more tightly

aaadd6b

Revert "integrate data structures more tightly"

c71228f

This reverts commit aaadd6b.

forward!

3d9c4c1

avoid range check for Eigen vector

9cf713c

use make_unique

c90aec7

bbbales2 reviewed Nov 5, 2020

View reviewed changes

stan/math/rev/core/scoped_chainablestack.hpp Outdated Show resolved Hide resolved

wds15 added 4 commits November 10, 2020 20:23

disallow nested scoped chainable stacks by the same instance

9a02d3f

get doc in line with tests

c81aa43

add tests for ScopedChainableStack

2c8b9cd

add tests to reduce_sum which test explicitly correctness when some a…

8350415

…rguments point internally to the same vari

wds15 marked this pull request as ready for review November 10, 2020 20:53

yashikno and others added 2 commits November 10, 2020 20:55

Merge commit '5291fc8cf03c901201243014957843ffb61c4705' into HEAD

af70725

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

e1de989

…4.1 (tags/RELEASE_600/final)

wds15 and others added 4 commits November 18, 2020 14:56

avoid unneded adjoint zeroing

19d04aa

Merge branch 'feature-fast-parallel-ad-2' of https://github.com/stan-…

c25aebd

…dev/math into feature-fast-parallel-ad-2

Merge commit 'e1e77973bf061bea1b1eab231bd204993a625759' into HEAD

0c14f8d

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

d48c275

…4.1 (tags/RELEASE_600/final)

wds15 and others added 6 commits November 24, 2020 17:24

make scoped chainablestack part of reducer instance

d72aa68

Merge remote-tracking branch 'origin/develop' into feature-fast-paral…

3578a58

…lel-ad-2-fix

bump

2f6a78e

Merge branch 'feature-fast-parallel-ad-2' of https://github.com/stan-…

442d605

…dev/math into feature-fast-parallel-ad-2

Merge branch 'feature-fast-parallel-ad-2-fix' into feature-fast-paral…

a9d8d8c

…lel-ad-2

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

c08489b

…4.1 (tags/RELEASE_600/final)

andrjohns previously approved these changes Dec 2, 2020

View reviewed changes

wds15 added 2 commits December 2, 2020 17:10

Merge branch 'feature-fast-parallel-ad-2' of https://github.com/stan-…

c26c4e4

…dev/math into feature-fast-parallel-ad-2

more tests and address review comments

f723afb

wds15 dismissed andrjohns’s stale review via f723afb December 2, 2020 17:28

yashikno and others added 2 commits December 2, 2020 17:29

Merge commit 'ff53436ddff8a156f821f1f8811bb6e8d846f004' into HEAD

9d216b5

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

08a072d

…4.1 (tags/RELEASE_600/final)

wds15 and others added 3 commits December 2, 2020 18:33

cpplint

16f4da9

Merge branch 'feature-fast-parallel-ad-2' of https://github.com/stan-…

82cef97

…dev/math into feature-fast-parallel-ad-2

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

16521ed

…4.1 (tags/RELEASE_600/final)

andrjohns approved these changes Dec 3, 2020

View reviewed changes

wds15 merged commit e98bfbd into develop Dec 3, 2020

rok-cesnovar mentioned this pull request Mar 23, 2021

speedup reduce_sum #2165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature faster reduce_sum #2162

Feature faster reduce_sum #2162

wds15 commented Oct 23, 2020 •

edited

Loading

stan-buildbot commented Oct 24, 2020

stan-buildbot commented Nov 3, 2020

wds15 commented Nov 10, 2020

wds15 commented Nov 10, 2020

stan-buildbot commented Nov 18, 2020

wds15 commented Nov 23, 2020

wds15 commented Nov 24, 2020

stan-buildbot commented Nov 25, 2020

rok-cesnovar commented Nov 28, 2020

andrjohns commented Nov 29, 2020

andrjohns left a comment

andrjohns Dec 2, 2020

t4c1 Dec 2, 2020

wds15 Dec 2, 2020

t4c1 Dec 2, 2020

wds15 Dec 2, 2020

t4c1 Dec 3, 2020

bbbales2 Dec 3, 2020

andrjohns Dec 2, 2020

wds15 Dec 2, 2020

andrjohns Dec 3, 2020

wds15 commented Dec 2, 2020

stan-buildbot commented Dec 3, 2020

andrjohns left a comment

	auto execute(F&& f, Args&&... args) {
	auto execute(const F& f, Args&&... args) {

Feature faster reduce_sum #2162

Feature faster reduce_sum #2162

Conversation

wds15 commented Oct 23, 2020 • edited Loading

Summary

Tests

Side Effects

Release notes

Checklist

stan-buildbot commented Oct 24, 2020

stan-buildbot commented Nov 3, 2020

wds15 commented Nov 10, 2020

wds15 commented Nov 10, 2020

stan-buildbot commented Nov 18, 2020

wds15 commented Nov 23, 2020

wds15 commented Nov 24, 2020

stan-buildbot commented Nov 25, 2020

rok-cesnovar commented Nov 28, 2020

andrjohns commented Nov 29, 2020

andrjohns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wds15 commented Dec 2, 2020

stan-buildbot commented Dec 3, 2020

andrjohns left a comment

Choose a reason for hiding this comment

wds15 commented Oct 23, 2020 •

edited

Loading