-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix single-client reader parallel #6288
Conversation
@@ -60,7 +61,7 @@ class OFRecordDataset final : public Dataset<TensorBuffer> { | |||
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp"); | |||
// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty, | |||
// we assume that it works in DDP | |||
if (nd_sbp_str_vec.empty()) { is_local = true; } | |||
if (nd_sbp_str_vec.empty() && CHECK_JUST(GlobalMultiClientEnv())) { is_local = true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OFRecordImageClassificationDataReader 还未迁移到 Multi-Client 下,所以如果这里判断了 is Multi-Client,60 行:if (ctx->op_type_name() == "OFRecordReader")
的特判就可以删掉了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
主要是 OFRecordImageClassificationDataReader 调用 61 行的 get attr nd_sbp
就会报错了,因为不存在这个 attr。
验证结果:lazy多卡数据读取已被修复修复之前的实验现象见:bert对齐进展issue
|
我们有很多地方都在内部特判,其实很危险。最好是比较高层的位置做if,这样能降低圈复杂度 |
Speed stats:
|
根据 @ouyangyu 反馈,#6222 会引起 single-client 下,Benchmark 中的 resnet50 在 fp16 下精度下降。
根据 @CPFLAME 反馈,lazy 多卡数据读取有问题,会读取重复的数据。
推测是这里 lazy 的某些用法导致 data reader op 会走到 ddp 的分支里面去。