Fuse instruction #7399

lixinqi · 2022-01-28T06:57:24Z

vm 为每条指令的执行做了不少支撑工作，这个开销与kernel launch相比往往不能忽略。本pr尝试将多条指令融合在一起，在vm里一起被调度。这样，vm对多条指令的调度成本就平摊开来了。

Signed-off-by: daquexian <[email protected]>

…ched_nccl stream; 2) VirtualMachineEngine::local_pending_msg_list_;

Signed-off-by: daquexian <[email protected]>

…low into refactor_release_tensor

… = false; 3) InstructionType::OnDispatch

…o fuse_instruction

oneflow/core/vm/virtual_machine_engine.cpp

daquexian · 2022-01-30T08:07:59Z

oneflow/core/vm/thread_ctx.cpp

-  intrusive::ChannelStatus status = mut_pending_instruction_list()->MoveTo(&tmp_list);
+  intrusive::ChannelStatus status = (mut_pending_instruction_list()->*Move)(&tmp_list);
+  *cnt = tmp_list.size();
+  if (*cnt == 0) { return status; }


这个 size_t* cnt 是什么作用呢，看起来可以是一个局部变量

因为44行需要返回。

daquexian · 2022-01-30T08:09:57Z

oneflow/core/vm/instruction_type.h

@@ -42,6 +50,7 @@ class InstructionType {
  virtual void Compute(VirtualMachineEngine* vm, InstructionMsg* instr_msg) const {
    LOG(FATAL) << "UNIMPLEMENTED";
  }
+  virtual void ComputeInFuseMode(InstructionMsg* instr_msg) const { LOG(FATAL) << "UNIMPLEMENTED"; }


这个可不可以去掉，直接调用 Compute 呢

应为Compute的参数类型是Instruction*，现在还需要ComputeInFuseMode，等single-client的代码移除后，InstructionMsg会和Instruction合并，到时这里就可以统一成Compute了。

daquexian · 2022-02-07T02:50:51Z

oneflow/core/vm/stream_type.h

@@ -61,6 +61,7 @@ class StreamType {
                                                           int64_t this_machine_id) const = 0;

  virtual bool OnSchedulerThread() const = 0;
+  virtual bool EnableInstructionFuse() const { return false; }


这个方法看起来没有用到

BBuf · 2022-02-07T13:41:05Z

oneflow/core/vm/virtual_machine_engine.cpp

-  int cnt = kLimit;
+  InstructionMsgList pending_instr_msgs;
+  constexpr static int kPendingHandleWindow = 10;
+  GetRewritedPendingInstructionsByWindowSize(kPendingHandleWindow, &pending_instr_msgs);


这里的意思是最多融合10条指令？融合的指令数以及依靠融合指令带来的launch开销是怎么确定呢？假如有100个tensor它们经过一个concat变成一个tensor，那么在Dispath concat这个LocaCallOpKernel的之后会有100条可以fuse的指令，这种情况是不是可以把100条都融合在一起？

这个10肯定影响融合的上限。但这个值的本意与指令融合没有关系。从变量名kPendingHandleWindow可以看出，它目的是为了让scheduler不要过分的只关注pending的情况，每处理完10条指令，应该尝试看看有没有其他stream上指令已完成。

kPendingHandleWindow这个值肯定可以调成100，但意义不大。因为并不是越大越好，越大就越可能导致其他stream的饥饿。

github-actions · 2022-02-07T17:29:56Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 129.1ms (= 12912.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 137.7ms (= 13773.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.07 (= 137.7ms / 129.1ms)

✔️ OneFlow resnet50 time: 75.1ms (= 7508.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.2ms (= 8420.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 84.2ms / 75.1ms)

OneFlow resnet50 time: 50.3ms (= 10053.0ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 54.5ms (= 10902.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.08 (= 54.5ms / 50.3ms)

OneFlow resnet50 time: 43.0ms (= 8605.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.2ms (= 9038.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.05 (= 45.2ms / 43.0ms)

OneFlow resnet50 time: 36.6ms (= 7311.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 42.3ms (= 8453.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.16 (= 42.3ms / 36.6ms)

✔️ OneFlow resnet50 time: 144.1ms (= 14411.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.6ms (= 16055.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 160.6ms / 144.1ms)

OneFlow resnet50 time: 88.3ms (= 8825.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.1ms (= 10112.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 101.1ms / 88.3ms)

OneFlow resnet50 time: 62.0ms (= 12396.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.7ms (= 14748.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 73.7ms / 62.0ms)

OneFlow resnet50 time: 54.9ms (= 10983.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.3ms (= 13067.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 65.3ms / 54.9ms)

OneFlow resnet50 time: 53.0ms (= 10600.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 59.1ms (= 11813.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 59.1ms / 53.0ms)

lixinqi and others added 30 commits December 16, 2021 13:51

refactor ReleaseTensor instruction

83ffbc9

remove LocalDepObject::logical_object

f2c2e77

remove LocalDepObjectPool

4234b78

refine code by profiling

508884d

support CurrentDevVmDepObjectConsumeMode for ReleaseTensor

15e72d0

merge branch refactor_release_tensor

8a63442

rm useless Touch instruction

53949f6

Merge branch 'master' into refactor_release_tensor

a46c652

Merge branch 'refactor_release_tensor' into rm_local_dep_object_pool

c6f0988

wrap model update in DevVmDepObjectConsumeMode::NONE

1cc1501

Signed-off-by: daquexian <[email protected]>

set AsyncCudaStreamType OnSchedulerThread to true

41d1979

Signed-off-by: daquexian <[email protected]>

Disable EventRecord in instruction ReleaseTensor

3b6a816

1) Disable instruction sequential for ReleaseTensor within async_laun…

6d43356

…ched_nccl stream; 2) VirtualMachineEngine::local_pending_msg_list_;

remove duplicated InstructionMsgList definition

66ada1f

Signed-off-by: daquexian <[email protected]>

CpuStreamType instructions run in a independent thread.

0b04937

merge master

9e252bd

Merge branch 'master' into refactor_release_tensor

57e67fd

Merge branch 'master' into refactor_release_tensor

076dfd2

Merge branch 'refactor_release_tensor' of github.com:Oneflow-Inc/onef…

c539fd6

…low into refactor_release_tensor

merge master

9202370

Merge branch 'rm_local_dep_object_pool' into rm_local_dep_object_pool++

d039606

set default value of EagerNcclBroadcastOp::async_launch to true

003bf34

Merge branch 'master' into refactor_release_tensor

04821b3

Merge branch 'rm_local_dep_object_pool' into rm_local_dep_object_pool++

5f745e6

revert CpuStreamType::OnSchedulerThread()

cb90cf8

stream_sequential_dependence

4036a5a

merge stream_sequential_dependence

dcdb7f1

FuseInstructionType

afb43c0

GetRewritedPendingInstructionsByWindowSize

2453b76

1) refactor WorkerLoop; 2) refactor CudaStreamType::OnSchedulerThread…

255b31b

… = false; 3) InstructionType::OnDispatch

lixinqi and others added 8 commits January 28, 2022 17:03

Merge branch 'master' into fuse_instruction

a49d24f

Merge branch 'master' into fuse_instruction

aa10ef3

merge master

581f206

Merge branch 'master' into fuse_instruction

d449c6a

Merge branch 'master' into fuse_instruction

27f131c

Merge branch 'fuse_instruction' of github.com:Oneflow-Inc/oneflow int…

a45c4a4

…o fuse_instruction

Merge branch 'master' into fuse_instruction

281a23e

fix bugs when solving conflicts

269c1dd

daquexian reviewed Jan 30, 2022

View reviewed changes

address pr comments

f4b6339

daquexian reviewed Feb 7, 2022

View reviewed changes

merge master

112850c

daquexian approved these changes Feb 7, 2022

View reviewed changes

Merge branch 'master' into fuse_instruction

b49c27d

oneflow-ci-bot self-requested a review February 7, 2022 12:47

lixinqi requested a review from BBuf February 7, 2022 13:07

oneflow-ci-bot removed their request for review February 7, 2022 13:09

BBuf reviewed Feb 7, 2022

View reviewed changes

BBuf approved these changes Feb 7, 2022

View reviewed changes

lixinqi requested a review from oneflow-ci-bot February 7, 2022 14:03

oneflow-ci-bot removed their request for review February 7, 2022 14:09

Merge branch 'master' into fuse_instruction

a37b52e

oneflow-ci-bot self-requested a review February 7, 2022 14:11

Merge branch 'master' into fuse_instruction

62d9798

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot February 7, 2022 16:13

oneflow-ci-bot removed their request for review February 7, 2022 17:45

oneflow-ci-bot merged commit cb5b274 into master Feb 7, 2022

oneflow-ci-bot deleted the fuse_instruction branch February 7, 2022 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse instruction #7399

Fuse instruction #7399

lixinqi commented Jan 28, 2022 •

edited

Loading

daquexian Jan 30, 2022

lixinqi Feb 2, 2022

daquexian Jan 30, 2022

lixinqi Feb 2, 2022

daquexian Feb 7, 2022

BBuf Feb 7, 2022

lixinqi Feb 7, 2022

github-actions bot commented Feb 7, 2022

Fuse instruction #7399

Fuse instruction #7399

Conversation

lixinqi commented Jan 28, 2022 • edited Loading

daquexian Jan 30, 2022

Choose a reason for hiding this comment

lixinqi Feb 2, 2022

Choose a reason for hiding this comment

daquexian Jan 30, 2022

Choose a reason for hiding this comment

lixinqi Feb 2, 2022

Choose a reason for hiding this comment

daquexian Feb 7, 2022

Choose a reason for hiding this comment

BBuf Feb 7, 2022

Choose a reason for hiding this comment

lixinqi Feb 7, 2022

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2022

lixinqi commented Jan 28, 2022 •

edited

Loading