Skip to content

Commit 99e0a87

Browse files
jjsjann123facebook-github-bot
authored andcommittedSep 25, 2020
[nvFuser] Latency improvements for pointwise + reduction fusion (pytorch#45218)
Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: pytorch#45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
1 parent 95df865 commit 99e0a87

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

84 files changed

+8911
-3188
lines changed
 

‎aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

+1
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ namespace at { namespace cuda {
4242
_(nvrtcGetProgramLog) \
4343
_(nvrtcGetLoweredName) \
4444
_(cuModuleLoadData) \
45+
_(cuModuleLoadDataEx) \
4546
_(cuModuleGetFunction) \
4647
_(cuOccupancyMaxActiveBlocksPerMultiprocessor) \
4748
_(cuGetErrorString) \

‎caffe2/CMakeLists.txt

+4
Original file line numberDiff line numberDiff line change
@@ -506,6 +506,7 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
506506
${TORCH_SRC_DIR}/csrc/cuda/comm.cpp
507507
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/arith.cpp
508508
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/compute_at.cpp
509+
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/codegen.cpp
509510
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/dispatch.cpp
510511
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/expr_evaluator.cpp
511512
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/executor.cpp
@@ -515,6 +516,7 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
515516
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/fusion.cpp
516517
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/graph_fuser.cpp
517518
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/index_compute.cpp
519+
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/instrumentation.cpp
518520
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/ir_base_nodes.cpp
519521
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/ir_cloner.cpp
520522
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/ir_graphviz.cpp
@@ -524,7 +526,9 @@ if(NOT INTERN_BUILD_MOBILE OR NOT BUILD_CAFFE2_MOBILE)
524526
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/kernel.cpp
525527
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/kernel_cache.cpp
526528
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/kernel_ir.cpp
529+
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/kernel_ir_builder.cpp
527530
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/lower_index.cpp
531+
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/lower_insert_syncs.cpp
528532
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/lower_loops.cpp
529533
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/lower_thread_predicate.cpp
530534
${TORCH_SRC_DIR}/csrc/jit/codegen/cuda/lower_unroll.cpp

0 commit comments

Comments
 (0)
Please sign in to comment.