Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse instruction #7399

Merged
merged 106 commits into from
Feb 7, 2022
Merged

Fuse instruction #7399

merged 106 commits into from
Feb 7, 2022

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Jan 28, 2022

vm 为每条指令的执行做了不少支撑工作,这个开销与kernel launch相比往往不能忽略。本pr尝试将多条指令融合在一起,在vm里一起被调度。这样,vm对多条指令的调度成本就平摊开来了。

lixinqi and others added 30 commits December 16, 2021 13:51
…ched_nccl stream; 2) VirtualMachineEngine::local_pending_msg_list_;
intrusive::ChannelStatus status = mut_pending_instruction_list()->MoveTo(&tmp_list);
intrusive::ChannelStatus status = (mut_pending_instruction_list()->*Move)(&tmp_list);
*cnt = tmp_list.size();
if (*cnt == 0) { return status; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 size_t* cnt 是什么作用呢,看起来可以是一个局部变量

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为44行需要返回。

@@ -42,6 +50,7 @@ class InstructionType {
virtual void Compute(VirtualMachineEngine* vm, InstructionMsg* instr_msg) const {
LOG(FATAL) << "UNIMPLEMENTED";
}
virtual void ComputeInFuseMode(InstructionMsg* instr_msg) const { LOG(FATAL) << "UNIMPLEMENTED"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可不可以去掉,直接调用 Compute 呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应为Compute的参数类型是Instruction*,现在还需要ComputeInFuseMode,等single-client的代码移除后,InstructionMsg会和Instruction合并,到时这里就可以统一成Compute了。

@@ -61,6 +61,7 @@ class StreamType {
int64_t this_machine_id) const = 0;

virtual bool OnSchedulerThread() const = 0;
virtual bool EnableInstructionFuse() const { return false; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法看起来没有用到

@oneflow-ci-bot oneflow-ci-bot self-requested a review February 7, 2022 12:47
@lixinqi lixinqi requested a review from BBuf February 7, 2022 13:07
@oneflow-ci-bot oneflow-ci-bot removed their request for review February 7, 2022 13:09
int cnt = kLimit;
InstructionMsgList pending_instr_msgs;
constexpr static int kPendingHandleWindow = 10;
GetRewritedPendingInstructionsByWindowSize(kPendingHandleWindow, &pending_instr_msgs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的意思是最多融合10条指令?融合的指令数以及依靠融合指令带来的launch开销是怎么确定呢?假如有100个tensor它们经过一个concat变成一个tensor,那么在Dispath concat这个LocaCallOpKernel的之后会有100条可以fuse的指令,这种情况是不是可以把100条都融合在一起?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个10肯定影响融合的上限。但这个值的本意与指令融合没有关系。从变量名kPendingHandleWindow可以看出,它目的是为了让scheduler不要过分的只关注pending的情况,每处理完10条指令,应该尝试看看有没有其他stream上指令已完成。

kPendingHandleWindow这个值肯定可以调成100,但意义不大。因为并不是越大越好,越大就越可能导致其他stream的饥饿。

@oneflow-ci-bot oneflow-ci-bot removed their request for review February 7, 2022 14:09
@oneflow-ci-bot oneflow-ci-bot self-requested a review February 7, 2022 14:11
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot February 7, 2022 16:13
@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2022

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 129.1ms (= 12912.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 137.7ms (= 13773.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.07 (= 137.7ms / 129.1ms)

✔️ OneFlow resnet50 time: 75.1ms (= 7508.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.2ms (= 8420.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 84.2ms / 75.1ms)

OneFlow resnet50 time: 50.3ms (= 10053.0ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 54.5ms (= 10902.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.08 (= 54.5ms / 50.3ms)

OneFlow resnet50 time: 43.0ms (= 8605.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 45.2ms (= 9038.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.05 (= 45.2ms / 43.0ms)

OneFlow resnet50 time: 36.6ms (= 7311.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 42.3ms (= 8453.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.16 (= 42.3ms / 36.6ms)

✔️ OneFlow resnet50 time: 144.1ms (= 14411.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.6ms (= 16055.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 160.6ms / 144.1ms)

OneFlow resnet50 time: 88.3ms (= 8825.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.1ms (= 10112.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 101.1ms / 88.3ms)

OneFlow resnet50 time: 62.0ms (= 12396.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.7ms (= 14748.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 73.7ms / 62.0ms)

OneFlow resnet50 time: 54.9ms (= 10983.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.3ms (= 13067.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 65.3ms / 54.9ms)

OneFlow resnet50 time: 53.0ms (= 10600.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 59.1ms (= 11813.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 59.1ms / 53.0ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review February 7, 2022 17:45
@oneflow-ci-bot oneflow-ci-bot merged commit cb5b274 into master Feb 7, 2022
@oneflow-ci-bot oneflow-ci-bot deleted the fuse_instruction branch February 7, 2022 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants