Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only call cudaGetDeviceProperties once. #135

Merged
merged 1 commit into from
Jun 2, 2021

Conversation

corbett5
Copy link
Member

@corbett5 corbett5 commented Jun 1, 2021

While debugging I added some timers to encode3launch and got the following results for a single precision field of size 528x520x512 on device.

Previously

Time between entering encode3launch and cudaMemset: 0.617ms
Encode elapsed time: 0.00064 (s)
# encode3 rate: 817.93 (GB / sec)

100 calls to `zfp_compress`: 0.158 s
100 calls to `zfp_decompress`: 0.228 s

Now

Time between entering encode3launch and cudaMemset: 0.000ms
Encode elapsed time: 0.00064 (s)
# encode3 rate: 818.09 (GB / sec)

100 calls to `zfp_compress`: 0.077 s
100 calls to `zfp_decompress`: 0.166 s

I guess this would break things if in between calls to ZFP the user set the active device and the new active device had a different max grid size, but I doubt anyone is doing that.

@lindstro
Copy link
Member

lindstro commented Jun 1, 2021

Might it make sense to query this information in zfp_stream_set_execution? It wouldn't directly address the issue you describe, but at least the user would then have a workaround.

@codecov-commenter
Copy link

Codecov Report

Merging #135 (2652ea1) into develop (fcae077) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #135   +/-   ##
========================================
  Coverage    94.84%   94.84%           
========================================
  Files          268      268           
  Lines        15216    15216           
========================================
  Hits         14431    14431           
  Misses         785      785           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcae077...2652ea1. Read the comment docs.

@corbett5
Copy link
Member Author

corbett5 commented Jun 2, 2021

Might it make sense to query this information in zfp_stream_set_execution? It wouldn't directly address the issue you describe, but at least the user would then have a workaround.

That makes sense to me, but you would have to add the max grid size to the zfp_stream struct. Also you would have to #include <cuda.h> in zpf.c and the way you have the build setup I think all the CUDA stuff is isolated from mainline zfp. Definitely a better solution but more invasive.

@lindstro
Copy link
Member

lindstro commented Jun 2, 2021

True, though zfp.c includes template/cudacompress.c, which in turn includes cuda_zfp/cuZFP.h, so things aren't that isolated anyway. Maybe we could add some generic init() function for each backend that gets called by zfp_stream_set_execution()? Moving this info into zfp_stream is not great since this struct is used for other purposes as well (e.g., for the compressed-array classes), though I can't think of a better solution unless we add a common pointer from zfp_stream to execution specific parameters that is null for the default serial execution mode. Any other ideas would be welcome.

@corbett5
Copy link
Member Author

corbett5 commented Jun 2, 2021

You're right we could put some init into cuZFP.h. But unless we put something into zfp_stream (could be a pointer) if the user wants to do device and then host compression (or in whatever order) they'll wind up calling cudaGetDeviceProperties each time which is what I'm trying to avoid. If it's associated with the stream then they could get around this by having one device stream and one host stream. I think that the current solution is pretty good though, I can't imagine a user will have a use case where they switch between GPU architectures at runtime.

@lindstro
Copy link
Member

lindstro commented Jun 2, 2021

Agreed, something would have to be put into zfp_stream. The idea was to use a single pointer instead of adding a lot of baggage directly to the struct that often won't be used, and which would change the layout of the zfp_stream struct depending on CUDA availability.

I was thinking that cudaGetDeviceProperties would be called once per stream, not once per (de)compression call, which seems like a small penalty to pay. We can revisit this later, however.

@lindstro
Copy link
Member

lindstro commented Jun 2, 2021

LGTM

@lindstro lindstro merged commit 9704462 into LLNL:develop Jun 2, 2021
@corbett5 corbett5 deleted the corbett/cuda-device-properties branch June 2, 2021 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants