refine device-related code #9791

daquexian · 2023-01-29T03:03:03Z

DTR 的设计中打算用不同的 device 区分开启/不开启重计算的 tensor（和 torch/xla 做法相同），实现过程中发现 device 相关的代码有些可改进的地方

device 无法作为 op 的 attr，master 里用分别设置 device_type 和 device_id 两个 attr 来代替，因此产生了很多无中生有的代码：

Device a;
op.SetAttr(a.device_type(), a.device_id());
Device b = Device::New(op.attr("device_type"), op.attr("device_id"));
// 直接 Device b = a 显然更简单

inline Maybe<bool> device_equal(const std::string& device_name, const int device_id,
                                Symbol<Device> device) {
    return (device_name == device->type() && device_id == device->device_id());
}
Device a;
op.SetAttr(a.device_type(), a.device_id());
if (device_equal(op.attr("device_type"), op.attr("device_id"), b))
// 直接 if (a == b) 显然更简单

这些冗余代码在给 Device 类增加新参数时也会引起额外的改动量

一些地方错误地使用了 Optional::value_or，如

auto device =
      device_.has_value() ? device_.value_or(Symbol<Device>()) : JUST(input->device());

一些命名问题，如 ParsingDeviceTag 没有用动词（改为 ParseDeviceTag）、Device::ThreadLocalGetOrNew 和 Device::New 功能相同，"New" 的含义却互相冲突（删掉了 Device::ThreadLocalGetOrNew）
operator== 和 operator!= 逻辑重复

Signed-off-by: daquexian <[email protected]>

…resolve conflict Signed-off-by: daquexian <[email protected]>

oneflow/core/functional/impl/array_functor.cpp

oneflow/core/job/parallel_desc.cpp

hjchen2 · 2023-01-29T03:37:54Z

oneflow/core/framework/device.cpp

@@ -62,37 +58,36 @@ Maybe<void> Device::Init() {
 }

 /* static */ Maybe<Symbol<Device>> Device::New(const std::string& type, int64_t device_id) {
-  return ThreadLocalGetOrNew(type, device_id);
+  return ThreadLocalGetOrCreate(type, device_id);


Device::New最好改成直接构造一个新的device，而不是调用ThreadLocalGetOrCreate

是什么原因呢，如果这样改的话有一个问题是和 Symbol<Device> 不兼容，symbol 是全局唯一的只读对象

也没有不兼容，构造device之后调用一下SymbolOf(device)就可以了。New不会缓存symbol，ThreadLocalGetOrCreate才缓存，这和这两个接口的命名也比较吻合，否则这两个接口就只需要保留一个就行了

更新：讨论后决定把两个接口只保留一个

mosout · 2023-01-29T03:33:21Z

oneflow/core/framework/attr_value_accessor.cpp

+}
+
+template<>
+void AttrValueAccessor<Symbol<Device>>::Attr(const Symbol<Device>& cpp_val, AttrValue* attr_val) {


这个接口的第二个参数感觉可以换成引用，另外这个接口感觉改成SetAttr或者AsignAttr会更合适点（当然这个接口在这个PR里改动也是不必要的，以后再说吧

mosout · 2023-01-29T03:37:24Z

oneflow/core/framework/device.cpp

  }
  return iter->second;
 }

-/* static */ Maybe<Symbol<Device>> Device::ThreadLocalGetOrNew(
-    const std::string& type_or_type_with_device_id) {
+/* static */ Maybe<Symbol<Device>> Device::ThreadLocalGetOrCreate(const std::string& str) {


str这个变量名可以换一个更有意义的名字

好的，改成 device_str 了，这个和 PyTorch 的命名对齐 https://pytorch.org/cppdocs/api/structc10_1_1_device.html#_CPPv4N3c106Device6DeviceERKNSt6stringE ，相应的 ParseDeviceTag 方法也改成了 ParseDeviceString

oneflow/core/framework/device.cpp

oneflow/core/framework/device.h

Signed-off-by: daquexian <[email protected]>

…efine_device

Signed-off-by: daquexian <[email protected]>

github-actions · 2023-01-29T09:46:56Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-01-29T11:08:59Z

Static analysis with clang failed. PR label automerge has been removed

Signed-off-by: daquexian <[email protected]>

…efine_device

github-actions · 2023-01-29T17:21:39Z

Speed stats:

github-actions · 2023-01-29T17:21:40Z

CI failed when running job: cuda-speed-test. PR label automerge has been removed

Signed-off-by: daquexian <[email protected]>

github-actions · 2023-01-30T09:47:25Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-01-30T21:14:47Z

Static analysis with clang failed. PR label automerge has been removed

github-actions · 2023-01-31T00:46:27Z

Speed stats:

GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.1ms (= 14007.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.3ms (= 16331.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 163.3ms / 140.1ms)

OneFlow resnet50 time: 85.9ms (= 8587.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.0ms (= 11201.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 112.0ms / 85.9ms)

OneFlow resnet50 time: 58.4ms (= 11678.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.3ms (= 16259.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 81.3ms / 58.4ms)

OneFlow resnet50 time: 44.0ms (= 8807.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.4ms (= 14277.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.62 (= 71.4ms / 44.0ms)

OneFlow resnet50 time: 40.4ms (= 8080.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15575.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.93 (= 77.9ms / 40.4ms)

github-actions · 2023-01-31T00:56:14Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9791/

github-actions · 2023-02-01T13:03:01Z

Speed stats:

GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 141.3ms (= 14134.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.2ms (= 16817.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 168.2ms / 141.3ms)

OneFlow resnet50 time: 87.0ms (= 8699.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.8ms (= 10582.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 105.8ms / 87.0ms)

OneFlow resnet50 time: 59.6ms (= 11921.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.1ms (= 15823.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 79.1ms / 59.6ms)

OneFlow resnet50 time: 45.6ms (= 9128.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.9ms (= 13972.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 69.9ms / 45.6ms)

OneFlow resnet50 time: 42.6ms (= 8528.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.4ms (= 14278.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.67 (= 71.4ms / 42.6ms)

github-actions · 2023-02-01T13:08:12Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9791/

核心的逻辑是： 1. 用不同的 device 区分支持/不支持重计算的 tensor 2. 在 remat::Allocator 里实现了选择 cost 最低的 tensor 并 evict 的逻辑（对内存布局和 evict 方式的优化就是在这里） 3. 在 OpCallInstructionUtil::Compute 里实现了重新计算出被用到但已被 evict 的 tensor 的逻辑其他的都是一些周边改动使用方式： ```python x1 = flow.ones(3).to('cuda+remat') # 移动到支持重计算的 device 上 x2 = flow.ones(3).to('cuda') # 移动到不支持重计算的 device 上 x3 = x2 + x3 # 报错：device 不同 # ----- model = ResNet50() model.to('cuda+remat') data, label = dataloader() data, label = data.to('cuda+remat'), label.to('cuda+remat') loss = model(data) # 如果过程中显存满了，会自动丢弃一些 tensor loss.backward() # 如果接下来又用到被丢弃的 tensor，会自动把它们重新计算出来 ``` ---- 一部分通用的改动已经在前置 PR 里被合并： * #9698 * #9791 * #9850 * #9851 --------- Signed-off-by: daquexian <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <[email protected]> Co-authored-by: Peihong Liu <[email protected]>

refine device-related code

0fa7bf9

Signed-off-by: daquexian <[email protected]>

daquexian added enhancement automerge system labels Jan 29, 2023

daquexian requested review from hjchen2, chengtbf and strint as code owners January 29, 2023 03:03

make ThreadLocalGetOrCreate private, rename Device to DeviceProto to …

ce65619

…resolve conflict Signed-off-by: daquexian <[email protected]>

clackhan approved these changes Jan 29, 2023

View reviewed changes

hjchen2 reviewed Jan 29, 2023

View reviewed changes

oneflow/core/functional/impl/array_functor.cpp Outdated Show resolved Hide resolved

oneflow/core/job/parallel_desc.cpp Show resolved Hide resolved

hjchen2 reviewed Jan 29, 2023

View reviewed changes

mosout approved these changes Jan 29, 2023

View reviewed changes

hjchen2 reviewed Jan 29, 2023

View reviewed changes

oneflow/core/framework/device.h Outdated Show resolved Hide resolved

mergify bot and others added 2 commits January 29, 2023 03:39

Merge branch 'master' into refine_device

0f25143

update formatter

bbe52b8

Signed-off-by: daquexian <[email protected]>

daquexian removed the automerge label Jan 29, 2023

Merge branch 'refine_device' of github.com:Oneflow-Inc/oneflow into r…

d544925

…efine_device

hjchen2 approved these changes Jan 29, 2023

View reviewed changes

rename ParseDeviceTag to ParseDeviceString, remove GetOrCreate

08a7283

Signed-off-by: daquexian <[email protected]>

daquexian added the automerge label Jan 29, 2023

daquexian requested a review from oneflow-ci-bot January 29, 2023 09:44

mergify bot and others added 2 commits January 29, 2023 09:44

Merge branch 'master' into refine_device

67f9aa3

auto format by CI

a0353c0

daquexian requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 29, 2023 09:48

github-actions bot removed the automerge label Jan 29, 2023

daquexian added 2 commits January 29, 2023 19:56

fix clang-tidy errors

8d32d4c

Signed-off-by: daquexian <[email protected]>

Merge branch 'refine_device' of github.com:Oneflow-Inc/oneflow into r…

792cf85

…efine_device

daquexian added the automerge label Jan 29, 2023

Merge branch 'master' into refine_device

e555e2e

github-actions bot removed the automerge label Jan 29, 2023

daquexian added 2 commits January 30, 2023 14:03

fix device in lazy op interpreter

d963482

Signed-off-by: daquexian <[email protected]>

Merge branch 'master' into refine_device

56fe098

daquexian added the automerge label Jan 30, 2023

mergify bot and others added 2 commits January 30, 2023 09:45

Merge branch 'master' into refine_device

cb92b63

auto format by CI

8dea816

daquexian requested review from oneflow-ci-bot and removed request for oneflow-ci-bot January 30, 2023 11:41

mergify bot added 2 commits January 30, 2023 13:25

Merge branch 'master' into refine_device

e7774fd

Merge branch 'master' into refine_device

a3585a0

github-actions bot removed the automerge label Jan 30, 2023

daquexian added the automerge label Feb 1, 2023

mergify bot added 4 commits February 1, 2023 02:24

Merge branch 'master' into refine_device

9819d15

Merge branch 'master' into refine_device

c38cc9d

Merge branch 'master' into refine_device

8ada2b5

Merge branch 'master' into refine_device

706f772

mergify bot merged commit 4bfef84 into master Feb 1, 2023

mergify bot deleted the refine_device branch February 1, 2023 13:48

daquexian mentioned this pull request Feb 13, 2023

Tensor Rematerialization (a.k.a. DTR/Coop) #9861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine device-related code #9791

refine device-related code #9791

daquexian commented Jan 29, 2023 •

edited

Loading

hjchen2 Jan 29, 2023

daquexian Jan 29, 2023

hjchen2 Jan 29, 2023 •

edited

Loading

daquexian Jan 29, 2023

mosout Jan 29, 2023

mosout Jan 29, 2023

daquexian Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Feb 1, 2023

github-actions bot commented Feb 1, 2023

refine device-related code #9791

refine device-related code #9791

Conversation

daquexian commented Jan 29, 2023 • edited Loading

hjchen2 Jan 29, 2023

Choose a reason for hiding this comment

daquexian Jan 29, 2023

Choose a reason for hiding this comment

hjchen2 Jan 29, 2023 • edited Loading

Choose a reason for hiding this comment

daquexian Jan 29, 2023

Choose a reason for hiding this comment

mosout Jan 29, 2023

Choose a reason for hiding this comment

mosout Jan 29, 2023

Choose a reason for hiding this comment

daquexian Jan 29, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 29, 2023

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Feb 1, 2023

github-actions bot commented Feb 1, 2023

daquexian commented Jan 29, 2023 •

edited

Loading

hjchen2 Jan 29, 2023 •

edited

Loading