Reorganize threads in generated OpenCL kernels #2392

t4c1 · 2021-02-24T11:32:37Z

Summary

Reorganizes how work is distributed between threads in generated kernels that use colwise reductions (this includes all distributions).

This improves performance and reduces amount of data that needs to be copied between GPU and host, significantly reducing running time on GPUs.

Tests

This is just a refactor. Existing tests cover all these changes.

Side Effects

Performance of running OpenCL on CPUs is decreased. However as significant speedups can be achieved even on integrated GPUs this should not be an issue.

Release notes

OpenCL: Reorganized how work is distributed between threads in generated kernels that use colwise reductions (including all distributions), significantly improving GPU preformance.

Checklist

Math issue #(issue number)
Copyright holder: Tadej Ciglarič

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

…CL kernelsthat use colwise reductions.

…4.1 (tags/RELEASE_600/final)

bbbales2 · 2021-02-26T17:22:28Z

It looks like the expression here
(except the version in this branch) which takes two broadcast'd types as input results in an expression that has -1 cols. On trying to evaluate this into a matrix_cl<double> there is an error.

This is the function call that triggers the error: https://github.com/stan-dev/math/blob/develop/test/unit/math/opencl/rev/dirichlet_lpdf_test.cpp#L156

I didn't figure out where the problem was -- one thing the code here returns -1 which doesn't seem right.

t4c1 · 2021-02-26T18:04:43Z

Thanks, I already found the issue. Now I just need to fix it in a not too hacky way.

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-03-02T13:01:42Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.47	3.5	0.99	-0.83% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.93	-7.87% slower
eight_schools/eight_schools.stan	0.11	0.11	1.01	0.99% faster
gp_regr/gp_regr.stan	0.16	0.16	1.02	1.69% faster
irt_2pl/irt_2pl.stan	5.24	5.13	1.02	2.12% faster
performance.compilation	90.31	88.74	1.02	1.74% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.69	8.63	1.01	0.62% faster
pkpd/one_comp_mm_elim_abs.stan	30.08	29.94	1.0	0.47% faster
sir/sir.stan	127.08	131.34	0.97	-3.35% slower
gp_regr/gen_gp_data.stan	0.05	0.04	1.1	8.9% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.98	3.04	0.98	-1.88% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.38	1.02	1.71% faster
arK/arK.stan	1.78	1.79	1.0	-0.43% slower
arma/arma.stan	0.73	0.74	0.98	-2.01% slower
garch/garch.stan	0.57	0.56	1.0	0.45% faster
Mean result: 1.00273275632

Jenkins Console Log
Blue Ocean
Commit hash: 9341add

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

rok-cesnovar

Great!

Reorganizes how work is distributed between threads in generated Open…

cd8bf95

…CL kernelsthat use colwise reductions.

t4c1 changed the title ~~Reorganizes threads in generated OpenCL kernels~~ Reorganize threads in generated OpenCL kernels Feb 24, 2021

stan-buildbot and others added 5 commits February 24, 2021 11:33

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

c204638

…4.1 (tags/RELEASE_600/final)

fixed calculation of number of rows with zero cols

da63c75

Merge commit 'b8bafbd4e1e171ebd74557bcf574a284a058a56a' into HEAD

5cce080

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

245547d

…4.1 (tags/RELEASE_600/final)

bugfix

e45eb48

t4c1 and others added 5 commits March 1, 2021 10:47

bugfixed dirichlet_lpdf and improved OpenCL testing util

c7b9903

Merge commit '796c0f0966e03dfc92cc7bb68ca670770e3fe528' into HEAD

f060c8c

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

182f6e1

…4.1 (tags/RELEASE_600/final)

fixed multiple translation units

c13adfa

added include

9341add

t4c1 requested a review from rok-cesnovar March 2, 2021 14:36

rok-cesnovar approved these changes Mar 2, 2021

View reviewed changes

rok-cesnovar merged commit 5533797 into stan-dev:develop Mar 2, 2021

rok-cesnovar deleted the opencl_reorganize branch March 2, 2021 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize threads in generated OpenCL kernels #2392

Reorganize threads in generated OpenCL kernels #2392

t4c1 commented Feb 24, 2021 •

edited

Loading

bbbales2 commented Feb 26, 2021

t4c1 commented Feb 26, 2021

stan-buildbot commented Mar 2, 2021

rok-cesnovar left a comment

Reorganize threads in generated OpenCL kernels #2392

Reorganize threads in generated OpenCL kernels #2392

Conversation

t4c1 commented Feb 24, 2021 • edited Loading

Summary

Tests

Side Effects

Release notes

Checklist

bbbales2 commented Feb 26, 2021

t4c1 commented Feb 26, 2021

stan-buildbot commented Mar 2, 2021

rok-cesnovar left a comment

Choose a reason for hiding this comment

t4c1 commented Feb 24, 2021 •

edited

Loading