Improve the performance and suitable for NPU computing #9642

leisuzz · 2024-10-11T02:03:13Z

What does this PR do?

Improve the performance (FPS) while training, and suitable for NPU computing.
Selection for free memory for CUDA or NPU

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

leisuzz · 2024-10-11T08:26:18Z

@sayakpaul Please refer to this one, thanks!

sayakpaul

Thanks! Just a couple comments.

sayakpaul · 2024-10-11T19:11:06Z

examples/text_to_image/train_text_to_image_sdxl.py

@@ -540,7 +541,7 @@ def compute_vae_encodings(batch, vae):
    with torch.no_grad():
        model_input = vae.encode(pixel_values).latent_dist.sample()
    model_input = model_input * vae.config.scaling_factor
-    return {"model_input": model_input.cpu()}
+    return {"model_input": accelerator.gather(model_input)}


Why do we need this?

By using the accelerator, the communication time can be reduced

I think the reason may caused by the pixel_values, as it is in vae.device (accelerator). Therefore, by changing the code, the accelerator can distribute and reduce the time cost.

But isn't an all-gather a more expensive op?

In fact, I tested three different approaches, the first one is accelerator.gather(model_input), the average FPS is 29.13 with training duration 530; the second one is model_input.to(accelerator.device), the average FPS is 27.41 with training duration 544; the last one is the original model_input.cpu(), the average FPS is 28.56 with training duration 537. Overall, with same hardware, the FPS will increase a little with accelerator.gather. I tested multiple times with accelerator.gather and model_input.cpu(), the average FPS in accelerator.gather is larger than model_input.cpu().

Hmm, thanks! Since the performance improvement seems to be minor, do you think it makes sense to not change this?

Make sense, I will change it back to the .cpu

But I welcome you to also add a note on your findings on accelerate.gather() so that users are aware. I think that'd still be quite valuable.

examples/text_to_image/train_text_to_image_sdxl.py

sayakpaul · 2024-10-11T19:11:56Z

examples/text_to_image/train_text_to_image_sdxl.py

@@ -1091,8 +1095,7 @@ def compute_time_ids(original_size, crops_coords_top_left):
                    # Adapted from pipeline.StableDiffusionXLPipeline._get_add_time_ids
                    target_size = (args.resolution, args.resolution)
                    add_time_ids = list(original_size + crops_coords_top_left + target_size)
-                    add_time_ids = torch.tensor([add_time_ids])
-                    add_time_ids = add_time_ids.to(accelerator.device, dtype=weight_dtype)
+                    add_time_ids = torch.tensor([add_time_ids], device=accelerator.device, dtype=weight_dtype)


Nice, this makes sense!

examples/text_to_image/train_text_to_image_sdxl.py

leisuzz · 2024-10-14T02:19:44Z

@sayakpaul I've changed the code based on your suggestions, thanks!

sayakpaul

Thanks!

leisuzz · 2024-10-14T07:06:59Z

@sayakpaul Is there anything I need to change to merge this PR? I saw there is something wrong with the 'Build PR Documentation', but I didn't change this section I think

sayakpaul · 2024-10-14T07:32:32Z

Code quality check needs to pass.

leisuzz · 2024-10-14T07:39:16Z

@sayakpaul I couldn't see which line is needed to be changed to pass

sayakpaul · 2024-10-14T07:40:24Z

Can you follow https://github.com/huggingface/diffusers/actions/runs/11319895166/job/31477993314?pr=9642 and follow the instructions from the logs?

leisuzz · 2024-10-14T09:12:33Z

@sayakpaul It didn't show the specific problem, but I figured it would be the formatting issue about my comment in line 544. I've changed it. Thanks

HuggingFaceDocBuilderDev · 2024-10-14T13:46:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2024-10-14T16:09:40Z

Thanks for your contributions!

* Improve the performance and suitable for NPU * Improve the performance and suitable for NPU computing * Improve the performance and suitable for NPU * Improve the performance and suitable for NPU * Improve the performance and suitable for NPU * Improve the performance and suitable for NPU --------- Co-authored-by: 蒋硕 <[email protected]> Co-authored-by: Sayak Paul <[email protected]>

Improve the performance and suitable for NPU

e2f0a7b

leisuzz changed the title ~~Improve the performance and suitable for NPU~~ Improve the performance and suitable for NPU computing Oct 11, 2024

leisuzz mentioned this pull request Oct 11, 2024

Improve the performance and suitable for NPU computing #9631

Closed

6 tasks

sayakpaul reviewed Oct 11, 2024

View reviewed changes

蒋硕 added 2 commits October 12, 2024 09:11

Improve the performance and suitable for NPU computing

ab3cd4f

Improve the performance and suitable for NPU

98f55d0

leisuzz force-pushed the main branch from 632a460 to 98f55d0 Compare October 12, 2024 07:39

Improve the performance and suitable for NPU

4100eb4

sayakpaul approved these changes Oct 14, 2024

View reviewed changes

Improve the performance and suitable for NPU

8ed6fe0

leisuzz force-pushed the main branch from 797c3a6 to 8ed6fe0 Compare October 14, 2024 07:52

Improve the performance and suitable for NPU

b79ab15

leisuzz force-pushed the main branch from b714d0e to b79ab15 Compare October 14, 2024 10:56

Merge branch 'main' into main

a1748fc

sayakpaul merged commit 5956b68 into huggingface:main Oct 14, 2024
8 checks passed

Improve the performance and suitable for NPU computing #9642

Improve the performance and suitable for NPU computing #9642

Uh oh!

Conversation

leisuzz commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

leisuzz commented Oct 11, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leisuzz commented Oct 14, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

leisuzz commented Oct 14, 2024

Uh oh!

sayakpaul commented Oct 14, 2024

Uh oh!

leisuzz commented Oct 14, 2024

Uh oh!

sayakpaul commented Oct 14, 2024

Uh oh!

leisuzz commented Oct 14, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Oct 14, 2024

Uh oh!

Uh oh!

sayakpaul commented Oct 14, 2024

Uh oh!

Uh oh!

leisuzz commented Oct 11, 2024 •

edited

Loading