-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a recently introduced race condition in DataLayer #2998
Conversation
@@ -13,6 +13,9 @@ void BasePrefetchingDataLayer<Dtype>::Forward_gpu( | |||
// Copy the data | |||
caffe_copy(batch->data_.count(), batch->data_.gpu_data(), | |||
top[0]->mutable_gpu_data()); | |||
// Ensure the copy is synchronous wrt the host, so that the next batch isn't | |||
// copied in meanwhile. | |||
CUDA_CHECK(cudaStreamSynchronize(cudaStreamLegacy)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @longjon for catching this and the explanation!
However, I am still a little confused why calling cudaStreamSynchronize(cudaStreamLegacy)
here, rather than several lines below after batch->label_
is also copied. To avoid data race, I think this synchronization should occur right before prefetch_free_.push(batch)
(correct me if I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'll admit I overlooked this because I wasn't using labels, I don't actually see a race there. It looks like (https://github.com/BVLC/caffe/blob/master/src/caffe/layers/base_data_layer.cpp#L87) only the data portion is copied in a non-blocking stream, so any label access should be synchronized with the default stream.
However I think it would be clearer and safer to keep all copies synchronous between queues as you suggest, and I don't see any significant performance implications, so I'll update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Thanks!
Thanks @longjon for finding this. Synchronizing on the default stream is done at a couple places in parallel.cpp. If cudaStreamLegacy needs to be used instead for old CUDA versions I can change it. P2PSync used to start async copies and call cudaStreamSynchronize once, but after @thatguymike tuning it calls sync after each copy, so we could have the same code as base_data_layer.cu. |
Previously, the prefetch GPU -> top GPU and prefetch CPU -> prefetch GPU copies were launched concurrently in separate streams, allowing the next batch to be copied in before the current one is read. This patch explicitly synchronizes the prefetch -> top copy wrt the host, preventing the CPU -> GPU from being launched until its completion.
@cypof, I think what I'm saying about |
Ah OK let's keep cudaStreamDefault everywhere then. For the docs I don't know, they have the programing guides and samples that can be helpful in addition to the headers/doxygen. |
This PR should receive high priority since it affects the correctness of all ongoing training with DataLayer, no matter single GPU or multi GPU. I hope to merge this as soon as possible (if no one opposes). |
+1 |
LGTM, thanks for tracking this down @longjon! |
Fix a recently introduced race condition in DataLayer
This PR fixes a race condition introduced to
DataLayer
by #2903, which can cause incorrect data to appear ontop
after a GPU forward. This affects single or multiple GPU usage with any data source.DataLayer
's forward copies prefetch GPU data -> top GPU data usingcaffe_copy
. Meanwhile, the prefetch thread copies prefetch CPU data -> prefetch GPU data using a non-blocking CUDA stream.caffe_copy
is asynchronous wrt the host (when device -> device). That means these two copies can happen in any order, giving you either this batch's data or the next's (or some combination?). If you have two synchronized data sources (e.g., separate images and labels), this can be catastrophic.Note that the queue pair is no help here; the
batch
is reinserted into the free queue immediately after the copy is issued, before it's completed.To reproduce this issue easily, set
PREFETCH_COUNT
to1
, and put the copy https://github.com/BVLC/caffe/blob/master/src/caffe/layers/base_data_layer.cu#L14 in a loop that executes, e.g., 1000 times. That shouldn't affect correctness, but gives the race enough time to occur reliably (on my system, at least).The fix here explicitly synchronizes the null stream used by
caffe_copy
. However, I think it requires CUDA 7. @thatguymike or others, what's the right way to do this without switching right away to CUDA 7?It would be nice if there were some way to test that this doesn't happen again, but that seems difficult...
Please note: Caffe as such, with few exceptions, uses the default stream with no explicit synchronization. Layer calls are asynchronous wrt the host. (That's why there's, e.g., #2077.)
caffe_copy
(ascudaMemcpy
) is asynchronous wrt the host when device -> device. If you create a non-blocking stream, don't expect it to be synchronous wrt any existing Caffe GPU code.