Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize threads in generated OpenCL kernels #2392

Merged
merged 11 commits into from
Mar 2, 2021

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Feb 24, 2021

Summary

Reorganizes how work is distributed between threads in generated kernels that use colwise reductions (this includes all distributions).

This improves performance and reduces amount of data that needs to be copied between GPU and host, significantly reducing running time on GPUs.

Tests

This is just a refactor. Existing tests cover all these changes.

Side Effects

Performance of running OpenCL on CPUs is decreased. However as significant speedups can be achieved even on integrated GPUs this should not be an issue.

Release notes

OpenCL: Reorganized how work is distributed between threads in generated kernels that use colwise reductions (including all distributions), significantly improving GPU preformance.

Checklist

  • Math issue #(issue number)

  • Copyright holder: Tadej Ciglarič

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@t4c1 t4c1 changed the title Reorganizes threads in generated OpenCL kernels Reorganize threads in generated OpenCL kernels Feb 24, 2021
@bbbales2
Copy link
Member

It looks like the expression here
(except the version in this branch) which takes two broadcast'd types as input results in an expression that has -1 cols. On trying to evaluate this into a matrix_cl<double> there is an error.

This is the function call that triggers the error: https://github.com/stan-dev/math/blob/develop/test/unit/math/opencl/rev/dirichlet_lpdf_test.cpp#L156

I didn't figure out where the problem was -- one thing the code here returns -1 which doesn't seem right.

@t4c1
Copy link
Contributor Author

t4c1 commented Feb 26, 2021

Thanks, I already found the issue. Now I just need to fix it in a not too hacky way.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.47 3.5 0.99 -0.83% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.93 -7.87% slower
eight_schools/eight_schools.stan 0.11 0.11 1.01 0.99% faster
gp_regr/gp_regr.stan 0.16 0.16 1.02 1.69% faster
irt_2pl/irt_2pl.stan 5.24 5.13 1.02 2.12% faster
performance.compilation 90.31 88.74 1.02 1.74% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.69 8.63 1.01 0.62% faster
pkpd/one_comp_mm_elim_abs.stan 30.08 29.94 1.0 0.47% faster
sir/sir.stan 127.08 131.34 0.97 -3.35% slower
gp_regr/gen_gp_data.stan 0.05 0.04 1.1 8.9% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.98 3.04 0.98 -1.88% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.38 1.02 1.71% faster
arK/arK.stan 1.78 1.79 1.0 -0.43% slower
arma/arma.stan 0.73 0.74 0.98 -2.01% slower
garch/garch.stan 0.57 0.56 1.0 0.45% faster
Mean result: 1.00273275632

Jenkins Console Log
Blue Ocean
Commit hash: 9341add


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@t4c1 t4c1 requested a review from rok-cesnovar March 2, 2021 14:36
Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@rok-cesnovar rok-cesnovar merged commit 5533797 into stan-dev:develop Mar 2, 2021
@rok-cesnovar rok-cesnovar deleted the opencl_reorganize branch March 2, 2021 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants