optimize_for_mobile: bring packed params to root module (pytorch#42740)

vkuzo · facebook-github-bot · commit 79b8328aaf54 · 2020-08-08T15:53:20.000-07:00
Summary: Pull Request resolved: pytorch#42740 Adds a pass to hoist conv packed params to root module. The benefit is that if there is nothing else in the conv module, subsequent passes will delete it, which will reduce module size. For context, freezing does not handle this because conv packed params is a custom object. Test Plan: ``` PYTORCH_JIT_LOG_LEVEL=">hoist_conv_packed_params.cpp" python test/test_mobile_optimizer.py TestOptimizer.test_hoist_conv_packed_params ``` Imported from OSS Reviewed By: kimishpatel Differential Revision: D23005961 fbshipit-source-id: 31ab1f5c42a627cb74629566483cdc91f3770a94
diff --git a/docs/source/mobile_optimizer.rst b/docs/source/mobile_optimizer.rst
@@ -12,6 +12,7 @@ By default, if optimization blacklist is None or empty, ``optimize_for_mobile``
     - **Insert and Fold prepacked ops** (blacklisting option `MobileOptimizerType::INSERT_FOLD_PREPACK_OPS`): This optimization pass rewrites the graph to replace 2D convolutions and linear ops with their prepacked counterparts. Prepacked ops are stateful ops in that, they require some state to be created, such as weight prepacking and use this state, i.e. prepacked weights, during op execution. XNNPACK is one such backend that provides prepacked ops, with kernels optimized for mobile platforms (such as ARM CPUs). Prepacking of weight enables efficient memory access and thus faster kernel execution. At the moment ``optimize_for_mobile`` pass rewrites the graph to replace ``Conv2D/Linear`` with 1) op that pre-packs weight for XNNPACK conv2d/linear ops and 2) op that takes pre-packed weight and activation as input and generates output activations. Since 1 needs to be done only once, we fold the weight pre-packing such that it is done only once at model load time. This pass of the ``optimize_for_mobile`` does 1 and 2 and then folds, i.e. removes, weight pre-packing ops.
     - **ReLU/Hardtanh fusion**: XNNPACK ops support fusion of clamping. That is clamping of output activation is done as part of the kernel, including for 2D convolution and linear op kernels. Thus clamping effectively comes for free. Thus any op that can be expressed as clamping op, such as ``ReLU`` or ``hardtanh``, can be fused with previous ``Conv2D`` or ``linear`` op in XNNPACK. This pass rewrites graph by finding ``ReLU/hardtanh`` ops that follow XNNPACK ``Conv2D/linear`` ops, written by the previous pass, and fuses them together.
     - **Dropout removal** (blacklisting option `MobileOptimizerType::REMOVE_DROPOUT`): This optimization pass removes ``dropout`` and ``dropout_`` nodes from this module when training is false.
+    - **Conv packed params hoisting** (blacklisting option `MobileOptimizerType::HOIST_CONV_PACKED_PARAMS`): This optimization pass moves convolution packed params to the root module, so that the convolution structs can be deleted. This decreases model size without impacting numerics.
 
 ``optimize_for_mobile`` will also invoke freeze_module pass which only preserves ``forward`` method. If you have other method to that needed to be preserved,  add them into the preserved method list and pass into the method.
 
diff --git a/test/test_mobile_optimizer.py b/test/test_mobile_optimizer.py
@@ -121,8 +121,6 @@ def forward(self, x):
         optimization_blacklist_no_prepack = {MobileOptimizerType.INSERT_FOLD_PREPACK_OPS}
         bn_fold_scripted_module = optimize_for_mobile(bn_scripted_module, optimization_blacklist_no_prepack)
         self.assertEqual(len(torch.jit.export_opnames(bn_fold_scripted_module)), 1)
-        FileCheck().check_count("prim::CallMethod[name=\"forward\"]", 1, exactly=True) \
-                   .run(str(get_forward_graph(bn_fold_scripted_module._c)))
         bn_input = torch.rand(1, 1, 6, 6)
         torch.testing.assert_allclose(bn_scripted_module(bn_input), bn_fold_scripted_module(bn_input), rtol=1e-2, atol=1e-3)
 
@@ -201,8 +199,6 @@ def forward(self, x):
             model = torch.jit.script(model)
             # this line should not have ASAN failures
             model_optim = optimize_for_mobile(model)
-            self.assertFalse(hasattr(model_optim.conv1, "bias"))
-            self.assertFalse(hasattr(model_optim.child.conv2, "bias"))
 
     def test_generate_mobile_module_lints(self):
         class MyTestModule(torch.nn.Module):
@@ -255,5 +251,102 @@ def get_lint_count_by_type(lint_type, module_lint_List):
         bi_module_lint_list = generate_mobile_module_lints(bi_module)
         self.assertEqual(len(bi_module_lint_list), 0)
 
+    @unittest.skipUnless(torch.backends.xnnpack.enabled,
+                         " XNNPACK must be enabled for these tests."
+                         " Please build with USE_XNNPACK=1.")
+    def test_hoist_conv_packed_params(self):
+
+        if 'qnnpack' not in torch.backends.quantized.supported_engines:
+            return
+
+        class Standalone(nn.Module):
+            def __init__(self):
+                super(Standalone, self).__init__()
+                self.quant = torch.quantization.QuantStub()
+                self.conv1 = nn.Conv2d(1, 1, 1)
+                self.conv2 = nn.Conv2d(1, 1, 1)
+                self.relu = nn.ReLU()
+                self.dequant = torch.quantization.DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant(x)
+                x = self.conv1(x)
+                x = self.conv2(x)
+                x = self.relu(x)
+                x = self.dequant(x)
+                return x
+
+            def fuse_model(self):
+                torch.quantization.fuse_modules(self, [['conv2', 'relu']], inplace=True)
+                pass
+
+        class Child(nn.Module):
+            def __init__(self):
+                super(Child, self).__init__()
+                self.conv1 = nn.Conv2d(1, 1, 1)
+
+            def forward(self, x):
+                x = self.conv1(x)
+                return x
+
+        class Parent(nn.Module):
+            def __init__(self):
+                super(Parent, self).__init__()
+                self.quant = torch.quantization.QuantStub()
+                self.conv1 = nn.Conv2d(1, 1, 1)
+                self.child = Child()
+                # TODO: test nn.Sequential after #42039 is fixed
+                self.dequant = torch.quantization.DeQuantStub()
+
+            def forward(self, x):
+                x = self.quant(x)
+                x = self.conv1(x)
+                x = self.child(x)
+                x = self.dequant(x)
+                return x
+
+            def fuse_model(self):
+                pass
+
+        with override_quantized_engine('qnnpack'):
+            def _quant_script_and_optimize(model):
+                model.qconfig = torch.quantization.get_default_qconfig('qnnpack')
+                model.fuse_model()
+                torch.quantization.prepare(model, inplace=True)
+                model(torch.randn(4, 1, 4, 4))
+                torch.quantization.convert(model, inplace=True)
+                model = torch.jit.script(model)
+                model_optim = optimize_for_mobile(model)
+                return model, model_optim
+
+            # basic case
+
+            m, m_optim = _quant_script_and_optimize(Standalone())
+            FileCheck().check_not("Conv2d = prim::GetAttr[name=\"conv1\"]") \
+                       .check_count("_jit_pass_hoist_conv_packed_params", 2, exactly=True) \
+                       .run(m_optim.graph)
+            self.assertFalse(hasattr(m_optim, "conv1"))
+            self.assertFalse(hasattr(m_optim, "conv2"))
+
+            data = torch.randn(4, 1, 4, 4)
+            m_res = m(data)
+            m_optim_res = m_optim(data)
+            torch.testing.assert_allclose(m_res, m_optim_res, rtol=1e-2, atol=1e-3)
+
+            # generic case
+
+            m, m_optim = _quant_script_and_optimize(Parent())
+            FileCheck().check_not("Conv2d = prim::GetAttr[name=\"conv1\"]") \
+                       .check_count("_jit_pass_hoist_conv_packed_params", 2, exactly=True) \
+                       .run(m_optim.graph)
+            self.assertFalse(hasattr(m_optim, "conv1"))
+            self.assertFalse(hasattr(m_optim, "child"))
+
+            data = torch.randn(4, 1, 4, 4)
+            m_res = m(data)
+            m_optim_res = m_optim(data)
+            torch.testing.assert_allclose(m_res, m_optim_res, rtol=1e-2, atol=1e-3)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/tools/build_variables.bzl b/tools/build_variables.bzl
@@ -173,6 +173,7 @@ core_sources_full = [
     "torch/csrc/jit/passes/graph_fuser.cpp",
     "torch/csrc/jit/passes/graph_rewrite_helper.cpp",
     "torch/csrc/jit/passes/guard_elimination.cpp",
+    "torch/csrc/jit/passes/hoist_conv_packed_params.cpp",
     "torch/csrc/jit/passes/inline_autodiff_subgraphs.cpp",
     "torch/csrc/jit/passes/inline_forked_closures.cpp",
     "torch/csrc/jit/passes/inliner.cpp",
diff --git a/torch/csrc/jit/passes/hoist_conv_packed_params.cpp b/torch/csrc/jit/passes/hoist_conv_packed_params.cpp
@@ -0,0 +1,134 @@
+#include <stack>
+
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/jit_log.h>
+#include <torch/csrc/jit/passes/constant_pooling.h>
+#include <torch/csrc/jit/passes/constant_propagation.h>
+#include <torch/csrc/jit/passes/hoist_conv_packed_params.h>
+#include <torch/csrc/jit/passes/quantization/helper.h>
+
+namespace torch {
+namespace jit {
+
+// Hoists packed params from a conv module to the parent module.
+// The benefit is that after this hoisting, the conv module
+// no longer holds anything and can be deleted, reducing model
+// size.
+//
+// Before (easy case):
+//
+// %1 = prim::GetAttr[name="conv1"][%self]
+// %2 = prim::GetAttr[name="_packed_params][%1]
+//
+// After (easy case):
+//
+// %2 = prim::GetAttr[name="{prefix}.conv1._packed_params"][%self]
+//
+// Before (generic case):
+//
+// %1 = prim::GetAttr[name="name1"][%self]
+// %2 = prim::GetAttr[name="name2"][%1]
+// ...
+// %n = prim::GetAttr[name="_packed_params][%n-1]
+//
+// After (generic case):
+//
+// %n =
+// prim::GetAttr[name="{prefix}.name1{...}.name(n-1)._packed_params"][%self]
+//
+void hoistConvPackedParams(
+    Module& rootModule,
+    Node* getConvPackedParamsNode,
+    const std::string& prefix,
+    int& nameUniqueCounter) {
+  auto method = rootModule.get_method("forward");
+  auto graph = method.graph();
+  Value* rootModuleAsValue = graph->inputs()[0];
+
+  // get a path from root module to conv module
+  Value* convModuleAsValue = getConvPackedParamsNode->inputs()[0];
+  std::vector<std::string> rootToConvPath =
+      getModuleAccessPath(convModuleAsValue, rootModuleAsValue);
+
+  // get a module object representing the conv
+  Module convModule = findChildModule(rootModule, rootToConvPath);
+
+  // get the packed params value
+  c10::IValue packedParams = convModule.attr("_packed_params");
+
+  // create the new name
+
+  std::string suffix = "";
+  for (const auto& attrName : rootToConvPath) {
+    suffix += attrName + ".";
+  }
+  std::string newNameBase = prefix + "." + suffix + "_packed_params";
+  nameUniqueCounter++;
+  std::string newName = newNameBase + "." + c10::to_string(nameUniqueCounter);
+  while (rootModule.hasattr(newName)) {
+    nameUniqueCounter++;
+    newName = newNameBase + "." + c10::to_string(nameUniqueCounter);
+  }
+
+  // copy the packed params
+  rootModule.register_attribute(newName, packedParams.type(), packedParams);
+
+  // change target module to rootModule
+  getConvPackedParamsNode->replaceInput(0, rootModuleAsValue);
+
+  // change attribute name to new name
+  getConvPackedParamsNode->s_(Symbol::attr("name"), newName);
+}
+
+void HoistConvPackedParams(script::Module& m) {
+  auto method = m.get_method("forward");
+  auto graph = method.graph();
+
+  std::stack<Block*> blocks_to_visit;
+  blocks_to_visit.push(graph->block());
+  std::string attr_name_base = "_jit_pass_hoist_conv_packed_params";
+  // counter to ensure new attribute names are unique
+  int nameUniqueCounter = 0;
+
+  while (!blocks_to_visit.empty()) {
+    Block* b = blocks_to_visit.top();
+    blocks_to_visit.pop();
+
+    for (Node* n : b->nodes()) {
+      // make sure this node is fetching {foo}.{_packed_params}
+      bool isGetPackedParamsNode =
+          n->kind() == prim::GetAttr && n->s(attr::name) == "_packed_params";
+      if (isGetPackedParamsNode) {
+        // make sure the foo in {foo}.{_packed_params} is a quantized conv
+        c10::optional<std::string> moduleName = getModuleName(n->inputs()[0]);
+        bool moduleNameIsQuantizedConv = moduleName.has_value() &&
+            (moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv1d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv2d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.quantized.modules.conv.Conv3d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU1d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU2d" ||
+             moduleName.value() ==
+                 "__torch__.torch.nn.intrinsic.quantized.modules.conv_relu.ConvReLU3d");
+
+        if (moduleNameIsQuantizedConv) {
+          GRAPH_UPDATE("Hoisting ", *n, " to root module.");
+          hoistConvPackedParams(m, n, attr_name_base, nameUniqueCounter);
+        }
+      }
+
+      for (Block* subblock : n->blocks()) {
+        blocks_to_visit.push(subblock);
+      }
+
+    } // for
+
+  } // while
+}
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/hoist_conv_packed_params.h b/torch/csrc/jit/passes/hoist_conv_packed_params.h
@@ -0,0 +1,12 @@
+#pragma once
+
+#include <torch/csrc/jit/api/module.h>
+#include <torch/csrc/jit/ir/ir.h>
+
+namespace torch {
+namespace jit {
+
+void HoistConvPackedParams(script::Module& m);
+
+} // namespace jit
+} // namespace torch
diff --git a/torch/csrc/jit/passes/quantization/helper.cpp b/torch/csrc/jit/passes/quantization/helper.cpp
@@ -530,8 +530,10 @@ bool hitGraphInput(Value* value) {
 // Get the module access path for a Value representing a module instance
 // by tracing back the GetAttr nodes and recording all the attribute
 // names along the way.
-// For example, the module access path will be ['sub', 'basic_block', 'conv1']
-// for `self.sub.basic_block.conv1`
+// Assuming 'self.sub.basic_block.conv1',
+// Input1: Value instance of conv1
+// Input2: Value instance of self
+// Output: ['sub', 'basic_block', 'conv1']
 std::vector<std::string> getModuleAccessPath(Value* instance, Value* self) {
   std::vector<std::string> path;
   // Iterator to traverse back the GetAttr calls
@@ -555,6 +557,10 @@ std::vector<std::string> getModuleAccessPath(Value* instance, Value* self) {
   return path;
 }
 
+// Assuming self.foo.bar.conv1,
+// Input1: Module instance of self
+// Input2: ['foo', 'bar', 'conv1']
+// Output: Module instance of conv1
 Module findChildModule(
     const Module& module,
     const std::vector<std::string>& path) {
diff --git a/torch/csrc/jit/passes/xnnpack_rewrite.cpp b/torch/csrc/jit/passes/xnnpack_rewrite.cpp
@@ -9,6 +9,8 @@
 #include <torch/csrc/jit/passes/fuse_linear.h>
 #include <torch/csrc/jit/passes/fuse_relu.h>
 #include <torch/csrc/jit/passes/graph_rewrite_helper.h>
+#include <torch/csrc/jit/passes/hoist_conv_packed_params.h>
+#include <torch/csrc/jit/passes/inliner.h>
 #include <torch/csrc/jit/passes/prepack_folding.h>
 #include <torch/csrc/jit/passes/remove_dropout.h>
 #include <torch/csrc/jit/passes/subgraph_rewrite.h>
@@ -294,6 +296,15 @@ script::Module optimizeForMobile(
     FoldPrePackingOps(cloned_module);
   }
 
+  if (!optimization_blocklist.count(
+          MobileOptimizerType::HOIST_CONV_PACKED_PARAMS)) {
+    // freeze again in case it was not done in previous optional passes
+    cloned_module = freeze_module(cloned_module, preserved_methods);
+    HoistConvPackedParams(cloned_module);
+    // and freeze yet again to remove the empty QuantizedConv modules
+    cloned_module = freeze_module(cloned_module, preserved_methods);
+  }
+
   // Run canonical optimizations post freezing
   // since freezing inlines the graph. Otherwise we
   // will have to explicitly call Inlining pass.
diff --git a/torch/csrc/jit/passes/xnnpack_rewrite.h b/torch/csrc/jit/passes/xnnpack_rewrite.h
@@ -11,6 +11,7 @@ enum class MobileOptimizerType : int8_t {
   INSERT_FOLD_PREPACK_OPS,
   REMOVE_DROPOUT,
   FUSE_ADD_RELU,
+  HOIST_CONV_PACKED_PARAMS,
 };
 
 TORCH_API void insertPrePackedOps(std::shared_ptr<Graph>& graph);
diff --git a/torch/csrc/jit/python/init.cpp b/torch/csrc/jit/python/init.cpp
@@ -725,6 +725,9 @@ void initJITBindings(PyObject* module) {
           MobileOptimizerType::INSERT_FOLD_PREPACK_OPS)
       .value("REMOVE_DROPOUT", MobileOptimizerType::REMOVE_DROPOUT)
       .value("FUSE_ADD_RELU", MobileOptimizerType::FUSE_ADD_RELU)
+      .value(
+          "HOIST_CONV_PACKED_PARAMS",
+          MobileOptimizerType::HOIST_CONV_PACKED_PARAMS)
       .export_values();
 
   // This allows PyTorchStreamReader to read from a Python buffer. It requires