[CUDA] max_pool2d NCHW performance improvement (pytorch#42182)

xwang233 · facebook-github-bot · commit fe4f19e16434 · 2020-07-29T19:01:31.000-07:00
Summary: Fix the regression introduced in pytorch#38953. Please see https://github.com/xwang233/code-snippet/blob/master/max-pool2d-nchw-perf/max-pool2d.ipynb for detailed before & after performance comparisons. Performance improvement for backward max_pool2d before and after this PR (negative value means speed up) ![image](https://user-images.githubusercontent.com/24860335/88712204-363c8e00-d0ce-11ea-8586-057e09b16103.png) Seems like the forward modulo doesn't benefit much from a similar change, so I did not change forward. pytorch@1718f0c Pull Request resolved: pytorch#42182 Reviewed By: albanD Differential Revision: D22829498 Pulled By: ngimel fbshipit-source-id: 4c81968fe072f4e264e70c70ade4c32d760a3af4
diff --git a/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu b/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu
@@ -175,7 +175,7 @@ __global__ void max_pool_backward_nchw(const int nthreads, const scalar_t* top_d
     scalar_t* bottom_diff) {
   CUDA_KERNEL_LOOP(index, height*width) {
     int h = index / width;
-    int w = index % width;
+    int w = index - h * width;
     int phstart = p_start(h, pad_h, kernel_h, dilation_h, stride_h);
     int phend = p_end(h, pad_h, pooled_height, stride_h);
     int pwstart = p_start(w, pad_w, kernel_w, dilation_w, stride_w);