Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session misc. #9502

Merged
merged 13 commits into from
Dec 19, 2022
Merged

Session misc. #9502

merged 13 commits into from
Dec 19, 2022

Conversation

leaves-zwx
Copy link
Contributor

@leaves-zwx leaves-zwx commented Dec 1, 2022

  • MultiClientSession (python/oneflow/framework/multi_client_session.py) 增加 Reset 接口
  • 删除 SessionGlobalObjectsScope (oneflow/core/job/session_global_objects_scope.h)
  • 删除 Cluster (oneflow/core/job/cluster.h)
  • 删除 session 不再使用的 api (oneflow/api/python/session/session.cpp)
  • 删除 session_util 中不再使用的 api (oneflow/api/python/framework/session_util.cpp)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.5ms (= 14053.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.1ms (= 16305.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 163.1ms / 140.5ms)

OneFlow resnet50 time: 86.3ms (= 8627.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.6ms (= 10557.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 105.6ms / 86.3ms)

OneFlow resnet50 time: 57.9ms (= 11579.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.8ms (= 15563.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 77.8ms / 57.9ms)

OneFlow resnet50 time: 45.7ms (= 9130.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.4ms (= 14084.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.54 (= 70.4ms / 45.7ms)

OneFlow resnet50 time: 40.5ms (= 8092.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.5ms (= 13699.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.69 (= 68.5ms / 40.5ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.8ms (= 14081.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.9ms (= 16391.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 163.9ms / 140.8ms)

OneFlow resnet50 time: 85.6ms (= 8560.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.3ms (= 11128.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 111.3ms / 85.6ms)

OneFlow resnet50 time: 57.9ms (= 11574.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.8ms (= 15569.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.8ms / 57.9ms)

OneFlow resnet50 time: 44.1ms (= 8813.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.7ms (= 15549.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.76 (= 77.7ms / 44.1ms)

OneFlow resnet50 time: 39.5ms (= 7907.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.3ms (= 13461.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 67.3ms / 39.5ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

@@ -43,6 +34,8 @@ ONEFLOW_API_PYBIND11_MODULE("", m) {
[](MultiClientSessionContext& session, const std::string& config_proto_str) {
return session.TryInit(config_proto_str).GetOrThrow();
})
.def("try_close",
[](MultiClientSessionContext& session) { return session.TryClose().GetOrThrow(); })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里关闭 Session 的操作,只是调用了 try close,没有析构 session 本身对吧。

然后再 try init 把之前释放的资源再 init 出来?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session 本身没法析构,以 shared_ptr 形式被 nn_graph 共享,我这里控制不了它的析构。

return torch.nn.functional.relu(x)

def setUp(self):
session_ctx.GetDefaultSession().Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset 是不是会有个 bad case,主动调用 Rest 时, graph 虽然当前执行完成了,但是后面可能还要用,这时是不是会触发错误了?

所以这个 Reset 使用有个前置条件,就是调用 reset 之前,之前创建的 graph 都得析构了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个需要用户控制了,用户也不需要这个接口,因为实际场景不会创建那么多 graph。

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.8ms (= 13979.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.0ms (= 16197.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.0ms / 139.8ms)

OneFlow resnet50 time: 84.8ms (= 8477.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.1ms (= 10208.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.1ms / 84.8ms)

OneFlow resnet50 time: 57.5ms (= 11495.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.3ms (= 15659.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.3ms / 57.5ms)

OneFlow resnet50 time: 44.6ms (= 8924.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.2ms (= 16231.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.82 (= 81.2ms / 44.6ms)

OneFlow resnet50 time: 40.1ms (= 8012.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.6ms (= 13518.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.69 (= 67.6ms / 40.1ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.7ms (= 13969.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.0ms (= 16100.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.0ms / 139.7ms)

OneFlow resnet50 time: 85.2ms (= 8517.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.2ms (= 10223.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.2ms / 85.2ms)

OneFlow resnet50 time: 57.3ms (= 11469.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.4ms (= 17471.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 87.4ms / 57.3ms)

OneFlow resnet50 time: 45.1ms (= 9025.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.4ms (= 13885.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.54 (= 69.4ms / 45.1ms)

OneFlow resnet50 time: 40.1ms (= 8012.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.2ms (= 13840.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 69.2ms / 40.1ms)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

config_proto.set_session_id(session_id);

CHECK(of::RegsterSessionId(session_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session id 未来也可以优化掉。这个概念应该没有实际的用处。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 session_id 我尝试删过了,删不掉。scope 还用到了 session_id。而且系统中全局搜索了 session id,还有很多地方使用,它们以什么方式取的得 session id,路劲上全部要修改。牵扯的改动挺多。所以还是另外的 pr 再去改了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以,我原本想的就是删掉 scope 中的 session id。不过可以后面再搞。 scope 本来也很重度,还有 symbol,后面一块儿重构吧。

static HashMap<int64_t, std::shared_ptr<Session>> id2session_map;
return &id2session_map;
}

std::vector<int64_t>* RegsiteredSessionIds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不顺道直接删掉 session id 吧,不需要记录 session id 2 session 的映射,没有多 session 同时存在的场景, scope 中的 session id 也可以删掉。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只保留 default session 的全局对象就行。

new_default_sess = MultiClientSession(env, oneflow._oneflow_internal.NewSessionId())
session_id = oneflow._oneflow_internal.NewSessionId()
assert oneflow._oneflow_internal.RegsterSessionId(session_id)
new_default_sess = MultiClientSession(env, session_id)
global _sess_id2sess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要这个 global,可以直接用 global sess,去掉 id 相关的所有函数。

@chengtbf chengtbf added the graph graph mode label Dec 9, 2022
@strint
Copy link
Contributor

strint commented Dec 9, 2022

是不是可以用这个功能来试验下用这个来清理 CI 中的 session。

然后直接就加到 CI 的 unittest 里面。

os.environ.get("ONEFLOW_TEST_RESET_SESSION_PERIOD", "10")
)
if RESET_SESSION_COUNT >= reset_session_period:
oneflow.framework.session_context.GetDefaultSession().Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里会影响 CI 吗?比如 CI 里自动进行 reset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个就是让 ci 每测试若干用例,就 reset 一次来重置 stream index 计数,避免出现 stream_index > 4096 的报错。

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.8ms (= 13982.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.2ms (= 16124.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.2ms / 139.8ms)

OneFlow resnet50 time: 84.7ms (= 8466.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 100.7ms (= 10073.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 100.7ms / 84.7ms)

OneFlow resnet50 time: 57.9ms (= 11575.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.6ms (= 15528.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 77.6ms / 57.9ms)

OneFlow resnet50 time: 45.0ms (= 9007.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.0ms (= 15992.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.78 (= 80.0ms / 45.0ms)

OneFlow resnet50 time: 40.8ms (= 8163.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.3ms (= 14858.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.82 (= 74.3ms / 40.8ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

@github-actions
Copy link
Contributor

CI failed when running job: cuda-speed-test. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.4ms (= 13939.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 16092.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.9ms / 139.4ms)

OneFlow resnet50 time: 85.2ms (= 8516.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.6ms (= 10155.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 101.6ms / 85.2ms)

OneFlow resnet50 time: 57.7ms (= 11534.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15570.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.9ms / 57.7ms)

OneFlow resnet50 time: 44.0ms (= 8804.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.3ms (= 16052.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.82 (= 80.3ms / 44.0ms)

OneFlow resnet50 time: 40.7ms (= 8146.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.4ms (= 14877.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.83 (= 74.4ms / 40.7ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.6ms (= 13963.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.7ms (= 16065.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.7ms / 139.6ms)

OneFlow resnet50 time: 84.6ms (= 8459.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.5ms (= 10145.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.5ms / 84.6ms)

OneFlow resnet50 time: 57.1ms (= 11427.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.1ms (= 15411.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.1ms / 57.1ms)

OneFlow resnet50 time: 43.9ms (= 8787.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.7ms (= 15749.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.79 (= 78.7ms / 43.9ms)

OneFlow resnet50 time: 39.1ms (= 7821.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.7ms (= 13737.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.76 (= 68.7ms / 39.1ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/

@mergify mergify bot merged commit 658c6f4 into master Dec 19, 2022
@mergify mergify bot deleted the session_reset branch December 19, 2022 14:50
liujuncheng pushed a commit that referenced this pull request Jan 5, 2023
Fixes https://github.com/Oneflow-Inc/OneTeam/issues/1861 ,在
TryCloseDefaultSession 之前设置 is_shutting_down 为
true(#9502 这个 PR 在
TryCloseDefaultSession 里引入了 Sync 操作),并把 python c 对象的指针保存在 PyFrame
对象里,GetCurrentFrame 里不再需要重复获取,修复因为没有拿到 gil 锁导致的 segfault

Signed-off-by: daquexian <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants