Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix single-client reader parallel #6288

Merged
merged 6 commits into from
Sep 15, 2021
Merged

Conversation

leaves-zwx
Copy link
Contributor

根据 @ouyangyu 反馈,#6222 会引起 single-client 下,Benchmark 中的 resnet50 在 fp16 下精度下降。

根据 @CPFLAME 反馈,lazy 多卡数据读取有问题,会读取重复的数据。

推测是这里 lazy 的某些用法导致 data reader op 会走到 ddp 的分支里面去。

@@ -60,7 +61,7 @@ class OFRecordDataset final : public Dataset<TensorBuffer> {
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp");
// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty,
// we assume that it works in DDP
if (nd_sbp_str_vec.empty()) { is_local = true; }
if (nd_sbp_str_vec.empty() && CHECK_JUST(GlobalMultiClientEnv())) { is_local = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OFRecordImageClassificationDataReader 还未迁移到 Multi-Client 下,所以如果这里判断了 is Multi-Client,60 行:if (ctx->op_type_name() == "OFRecordReader") 的特判就可以删掉了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要是 OFRecordImageClassificationDataReader 调用 61 行的 get attr nd_sbp 就会报错了,因为不存在这个 attr。

@CPFLAME
Copy link

CPFLAME commented Sep 15, 2021

验证结果:lazy多卡数据读取已被修复

修复之前的实验现象见:bert对齐进展issue
用两个数据集分别跑了bert lazy和graph的四卡训练:

  • 1.包含四个不同data part 数据,diffpart
  • 2.包含同样的四个不同的data part数据, 但是顺序不一致,shuffle

数据的分布如下图:
image

以下为实验结果:loss基本对齐
image

@lixinqi
Copy link
Contributor

lixinqi commented Sep 15, 2021

我们有很多地方都在内部特判,其实很危险。最好是比较高层的位置做if,这样能降低圈复杂度

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 15, 2021 04:09
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 15, 2021 06:05
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 15, 2021 07:22
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 15, 2021 08:24
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 15, 2021 09:52
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.6ms (= 6377.9ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.3ms (= 7016.8ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 140.3ms / 127.6ms)

OneFlow resnet50 time: 74.1ms (= 3706.5ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.0ms (= 4201.4ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 84.0ms / 74.1ms)

OneFlow resnet50 time: 47.2ms (= 2362.4ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.4ms (= 2921.7ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.24 (= 58.4ms / 47.2ms)

OneFlow resnet50 time: 39.5ms (= 1974.5ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.2ms (= 2412.1ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.22 (= 48.2ms / 39.5ms)

OneFlow resnet50 time: 44.2ms (= 2209.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.3ms (= 1917.1ms / 50, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.87 (= 38.3ms / 44.2ms)

OneFlow resnet50 time: 151.0ms (= 7551.3ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 155.3ms (= 7763.2ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.03 (= 155.3ms / 151.0ms)

OneFlow resnet50 time: 97.5ms (= 4874.9ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.3ms (= 4963.1ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.02 (= 99.3ms / 97.5ms)

OneFlow resnet50 time: 78.0ms (= 3899.0ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.8ms (= 4090.7ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.05 (= 81.8ms / 78.0ms)

OneFlow resnet50 time: 75.4ms (= 3772.3ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.0ms (= 3650.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.97 (= 73.0ms / 75.4ms)

OneFlow resnet50 time: 70.8ms (= 3539.4ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 58.8ms (= 2942.4ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.83 (= 58.8ms / 70.8ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 94d8b21 into master Sep 15, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fix_reader_legacy_parallel branch September 15, 2021 10:51
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 15, 2021 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants