-
-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature faster reduce_sum #2162
Conversation
Co-authored-by: Steve Bronder <[email protected]>
Co-authored-by: Steve Bronder <[email protected]>
Co-authored-by: Steve Bronder <[email protected]>
Co-authored-by: Steve Bronder <[email protected]>
…v/math into feature-fast-parallel-ad
…v/math into feature-fast-parallel-ad
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@andrjohns it would be great if you could verify on your end the merits of this PR on your machine (which is a Linux, right?). For that a end-to-end test is probably the best thing. I prepared this little R script which runs an updated brms threading vignette twice with different cmdstan installations. So please configure two cmdstan installations (one with 2.25.0 and using 2.25.0 cmdstan together with this math branch). It should be straightforward for you to verify then the speedups by this PR. Once we have that secured I will do the simplification as mentioned above... an I still owe you some tests for the scope chainable stack. Does that work as a plan? Here is the benchmark: |
@andrjohns, @bbbales2 or @SteveBronder this should be good to go now. Tests for the new scoped thing are in. If you think there are more creative ways to test it, we can add them, of course. Before it goes in, it would be great if my promising benchmarks can be reconfirmed by someone with the brms threading example or/and with the microbenchmark I wrote here: |
…dev/math into feature-fast-parallel-ad-2
…4.1 (tags/RELEASE_600/final)
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Hmm... I was curious and ran the microbenchmark on a Linux machine. The result looks good: So things improve a lot for the shared case. However, I also saw occasionally failures of the nesting_gradient test. I am not yet sure if this is due to the machine or a real thing. Until this is resolved this should not be merged (despite the nice speed gain, but a nested pattern needs to be supported). |
…dev/math into feature-fast-parallel-ad-2
…4.1 (tags/RELEASE_600/final)
Removing one The performance is still just fine from what I can see. @andrjohns , @SteveBronder or @bbbales2 up for a review? |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@andrjohns would you mind looking at this one? |
Yep, will take a look at this tomorrow (sorry for the delay so far!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a lot to add here, it was all pretty elegant in its simplicity. Just a couple of queries, feel free to ignore and merge if not relevant.
This should also generalise pretty easily to the current parallel_map
design, since I can just use the indexing function in the scoped deep copy
* @return Result of evaluated functor | ||
*/ | ||
template <typename F, typename... Args> | ||
auto execute(F&& f, Args&&... args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you'll want auto&&
or decltype(auto)
when returning forwarding references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually you want decltype(auto)
in such cases. Although in this case forwarding is not necessary at all - forwarding a functor before calling it does not do anything. So the function can just as well accept functor argument by const reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then let's go with decltype(auto)
... but then you are saying we do not need forwarding? I am not an expert on that matter (yet). When I look at the TBB doc they have these variants in a similar context:
// Supported since C++11
template<typename F> auto execute(F& f) -> decltype(f());
template<typename F> auto execute(const F& f) -> decltype(f());
template<typename F> void enqueue(F&& f);
What, @t4c1 , would you suggest to implement in our case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my response was really not exact. For this function I would change the signature like this:
auto execute(F&& f, Args&&... args) { | |
auto execute(const F& f, Args&&... args) { |
And remove the std::forward
call. Since this is not actually ever returning references (unless f is returning a reference, but I think that is never actually needed) the return type can remain auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I tried the const thing, but then I can't do everything I want this thing to be able to handle. I wrote a test which fails to compile with the const thing where the functor called changes internal state of the functor. In this case the operator() is non-const and as such the const F&
does not compile.
From comparing with how other libraries handle this a decltype(auto)
as return type sounds reasonable to me.
... and the forward thing: I thought this is necessary to avoid calls to copy-constructors for cases when the operator()()
functor creates internally some object which is then returned by value... no? Or does that mean we want to move here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know modifying internal state of the functor is a realistic use case. Accepting the functor as F&& f
is fine than.
About forwarding: Only if the functor overloads both operator()() const&
and operator()() &&
with different behavior. Otherwise neither forward
nor move
affect how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know modifying internal state of the functor is a realistic use case.
Yeah, same. What's the use case?
// not point | ||
// back to main autodiff stack | ||
|
||
if (!local_args_tuple_scope_.args_tuple_holder_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick sanity check to make sure that I understand this right. Because this is getting called in each 'chunk' of work, the conditional is checking whether the deep copy has already happened for the given thread that the chunk belongs to.
How does it ensure that there is a different copy for each thread? Is that what the std::mutex
is doing in scoped_chainablestack
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional checks if the local copy of the shared arguments has already been made for the particular reducer instance. The TBB may use more than one of these workers then there are thread, but the number of workers should be at the order of how many threads we have. Note that the TBB schedules work chunks by using the reducer instances; these work chunks (or tasks) are handed to the scheduler which executes them in whatever thread. Since reducers are used multiple times for multiple ranges it can happen that one reducer gets to run in different threads.
How does it ensure that there is a different copy for each thread?
We do not need to check for threads here at all anymore, since everything is referred to from the reduce instances (but only one instance of the reducers can ever be active in a given thread, of course).
The mutex
in scoped_chainablestack
will prevent that for a particular instance the execute
is called simultaneously from different threads. Whenever that happens the program will terminate as it really must never happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, makes sense thanks
…dev/math into feature-fast-parallel-ad-2
@andrjohns thanks for your review so far. I hope my answers make sense to you. I have added one more test to ensure the scoped_chainablstack does what I want to (allow the functor to change the internal state). @t4c1 I hope you are good with things as well. |
…dev/math into feature-fast-parallel-ad-2
…4.1 (tags/RELEASE_600/final)
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me, and it looks like you addressed @t4c1's forwarding comments
Summary
The current
reduce_sum
parallel AD implementation suffers from performance bottlenecks whenever many shared arguments are being used. The issue is that the input operands must be copied per work chunk as scheduled by the Intel TBB. This creates potentially a large burden on the AD system. This PR limits the number of deep copies of the share input operands to the number of used threads. The merits of this approach have been benchmarked and shown here.The idea of this PR is to introduce scoped chainable stacks which allow functions to be executed and record their AD tape bits and pieces onto a separate AD tape. Thus, this allows to write the AD tape of a running function into an instance which is not necessarily the AD tape instance of the respective thread.
Tests
All
reduce_sum
tests still run just fine. Additional components introduced still need tests.Side Effects
Makes things faster.
Release notes
Speedup
reduce_sum
parallelism.Checklist
Math issue speedup reduce_sum #2165
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make test-math-dependencies
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested