Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine offload test #9974

Merged
merged 2 commits into from
Mar 10, 2023
Merged

refine offload test #9974

merged 2 commits into from
Mar 10, 2023

Conversation

strint
Copy link
Contributor

@strint strint commented Mar 10, 2023

Related issue: #9971

layer_list.append(nn.Linear(768, 4096))
# Big enough to seem mem change
layer_list.append(nn.Linear(4096, 4096))
Copy link
Contributor

@lixiang007666 lixiang007666 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

大缓存和小缓存看起来差距不是很大?是指 nn.Linear(768, 4096) 不能被 offload 吗?

Copy link
Contributor Author

@strint strint Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果 tensor 太小,发现 offload 和 load 的 cuda memory 没有变化。

Copy link
Contributor

@lixiang007666 lixiang007666 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦哦 明白了 我们之前测试的都是 1024 x 1024 x 1024 这样的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果 tensor 太小,发现 offload 和 load 的 cuda memory 没有变化。

和 BinAllocator 的实现有关,如果一个 Block 不是都空的话,不会释放。Block 里面有一个或者多个 Pice,一个 Pice 最少 512 Byte。

所以如果当前释放的不足以产生一个 free 的 Block,就会导致 CachingAllocator 清理不出缓存。

至于 Block 会有多大,需要 @chengtbf 帮忙介绍下。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20M 大概

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.5ms (= 14151.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 145.8ms (= 14584.6ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 145.8ms / 141.5ms)

OneFlow resnet50 time: 84.3ms (= 8426.0ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 88.6ms (= 8862.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 88.6ms / 84.3ms)

OneFlow resnet50 time: 51.6ms (= 10318.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.3ms (= 12065.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 60.3ms / 51.6ms)

OneFlow resnet50 time: 34.1ms (= 6822.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.6ms (= 8713.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.28 (= 43.6ms / 34.1ms)

OneFlow resnet50 time: 27.1ms (= 5410.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.2ms (= 7840.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.45 (= 39.2ms / 27.1ms)

OneFlow swin dataloader time: 0.239s (= 47.895s / 200, num_workers=1)
PyTorch swin dataloader time: 0.148s (= 29.697s / 200, num_workers=1)
Relative speed: 0.620 (= 0.148s / 0.239s)

OneFlow swin dataloader time: 0.066s (= 13.191s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.625s / 200, num_workers=4)
Relative speed: 0.654 (= 0.043s / 0.066s)

OneFlow swin dataloader time: 0.040s (= 8.051s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.463s / 200, num_workers=8)
Relative speed: 0.554 (= 0.022s / 0.040s)

❌ OneFlow resnet50 time: 155.3ms (= 15527.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.9ms (= 16990.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.09 (= 169.9ms / 155.3ms)

OneFlow resnet50 time: 94.6ms (= 9457.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.5ms (= 10448.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 104.5ms / 94.6ms)

OneFlow resnet50 time: 61.9ms (= 12385.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.3ms (= 15862.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 79.3ms / 61.9ms)

OneFlow resnet50 time: 43.5ms (= 8699.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.1ms (= 13815.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 69.1ms / 43.5ms)

OneFlow resnet50 time: 36.8ms (= 7364.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.2ms (= 14633.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.99 (= 73.2ms / 36.8ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9974/

@mergify mergify bot merged commit 823e27e into master Mar 10, 2023
@mergify mergify bot deleted the refine_tensor_offload branch March 10, 2023 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants