Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to support Ruyi-Mini-7B #18

Closed
cellzero opened this issue Jan 3, 2025 · 16 comments
Closed

Trying to support Ruyi-Mini-7B #18

cellzero opened this issue Jan 3, 2025 · 16 comments

Comments

@cellzero
Copy link

cellzero commented Jan 3, 2025

Great work! I can't wait to use TeaCache to accelerate diffusion models.

I'm currently trying to integrate TeaCache with Ruyi-Models. However, I think I might have made a mistake, as I'm not getting a good L1 difference visualization according to the paper. Here are the results I obtained.

Ruyi_L1_Visualization

  • Time Embs is the original embeddings of the timestep.
  • Time With Conditions is the timestep embeddings added with condition embeddings (such as text and image).
  • Time Modulated Inputs is the values after transformer.block[0].norm1.
  • Transformer Inputs is the input of transformer.
  • Transformer Outputs is the output just after all transformer blocks, before final layer.
  • Transformer Final Outputs is the final output of transformer.

I've tried several timestep embeddings and model outputs, but it seems that the timestep embeddings don't have a strong correlation with Ruyi's model output. Could you help me identify which timestep embedding and model output should be used to achieve better correlation? Thank you!

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 3, 2025

Thank you for your interest in our work.

  1. It seems that Ruyi-Models uses similar transformer to HunyuanVideo. You may try to leverage the coeff in TeaCache4HunyuanVideo.
  2. You may alos refer to the implementation of teacache_forward to select the features to draw the picture ( and calculate the coeff), e.g., modulated_inp and residual_output.

Looking forward to your feedback and PR to support Ruyi-Models.

@cellzero
Copy link
Author

cellzero commented Jan 3, 2025

Yes, I have referred to the implementation of TeaCache4HunyuanVideo, but the model structure is different, so I'm uncertain whether the timestep embeddings and model outputs I used are appropriate.

Regarding the results shown above, I think they may not demonstrate a strong correlation. Or is the correlation presented above acceptable for using TeaCache?

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 3, 2025

  1. The output for visualization is the residual output, i.e., (output hidden_states - input hidden_states), as shown in the link in the last reply.
  2. The normed output hidden_states in Ruyi-Models is suggested. You may also try the output hidden_states before normalization.
  3. Time Embs shows a not bad correlation. You can reduce the estimation error with rescaling.
  4. Make sure you are using the relative l1 distance instead of l1 distance.

@KivenJonathan
Copy link

  1. The output for visualization is the residual output, i.e., (output hidden_states - input hidden_states), as shown in the link in the last reply.
  2. The normed output hidden_states in Ruyi-Models is suggested. You may also try the output hidden_states before normalization.
  3. Time Embs shows a not bad correlation. You can reduce the estimation error with rescaling.
  4. Make sure you are using the relative l1 distance instead of l1 distance.

Hi, thank you for your work on this project!

I have a question regarding the computation of coefficients. If we adopt the coefficient computed from the normed output hidden_states (using L1 relative residue between steps) and compute the coefficient with the time-modulated inputs (using L1 relative values after the first block of norm1), does this imply that the cached residue output should be updated based on the normed output, rather than the transformer outputs before the final layer?

Additionally, are there any suggested metrics or evidence that could help verify whether the coefficients are being computed correctly and derived from the appropriate modulated input and output of the transformer blocks?

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 3, 2025

Hi, @KivenJonathan . Thank you for your interest in our work. I don't get your first point. What's the difference between 'normed output' and ' transformer outputs before the final layer'? In my understanding, they are the same.

You can plot with the rescaled data to check it.

@cellzero
Copy link
Author

cellzero commented Jan 3, 2025

  1. The output for visualization is the residual output, i.e., (output hidden_states - input hidden_states), as shown in the link in the last reply.
  2. The normed output hidden_states in Ruyi-Models is suggested. You may also try the output hidden_states before normalization.
  3. Time Embs shows a not bad correlation. You can reduce the estimation error with rescaling.
  4. Make sure you are using the relative l1 distance instead of l1 distance.

Oh, I see where I made a mistake earlier. I used the wrong input and output values for visualization, and I've corrected that now. However, the visualization still doesn't seem particularly ideal.

l1_rel_distances

In the visualization, I used the hidden_states before any blocks as the input, the hidden_states after all the blocks minus input as the residual output, and the normed hidden_states minus input as the residual norm output.

I've searched extensively and made several attempts, but I still haven't identified the issue. Sadly.

Here is the code for calculating the L1 Rel Distance:

l1_distance = torch.abs(tensor1 - tensor2).mean()
norm = torch.abs(tensor1).mean()
relative_l1_distance = l1_distance / norm
return relative_l1_distance.to(torch.float32)

I plan to continue experimenting with different inputs to see if there are any changes.

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 4, 2025

According to the visualization, you may use 'Time with Conditions'. It's hard to get them equal. It's okay with similar trend, increasing or decreasing at the same time. Polynomial fitting helps to reduce the estimation error.

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 4, 2025

By the way, tensor 1 should be the feature in the last timestep and tensore 2 is the feature in the current timestep.

@cellzero
Copy link
Author

cellzero commented Jan 4, 2025

Thank you for your quick reply. I've generated some additional visualizations based on different inputs, and the trends appear to be similar.

Interestingly, both the residual output and the residual norm output tend to fluctuate during the initial few steps, making it difficult to align with the timestep embeddings. Perhaps I could try enforcing the first few steps without caching. Anyway, I think I should try polynomial fitting first.

As for tensor1 and tensor2, I used a for step in range(1, 25) loop, where tensor1 corresponds to the step value and tensor2 corresponds to the step + 1 value. I think this is essentially the same thing for visualization.

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 4, 2025

Sounds good. Looking forward to your final result.

@cellzero
Copy link
Author

cellzero commented Jan 9, 2025

Collecting data for polynomial fitting does take a considerable amount of time. Currently, I have generated about 100 videos and collected the L1 Rel Distances. After performing polynomial fitting, it appears that the Time Modulated Inputs and Transformer Residual Norm Output match the best.

I then used that value to integrate TeaCache into Ruyi. Sometimes, the generated videos show no obvious differences, while at other times, the videos are acceptable but do exhibit some notable differences. I think this could be a normal occurrence; is that correct?

Additionally, I would like to confirm one more thing. When applying polynomial fitting, the input is the L1 Rel Distance of Time Modulated Inputs, and the output is the L1 Rel Distances of Transformer Residual Norm Output. I hope I haven’t made any mistakes.

Thank you.

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 9, 2025

The difference depends on the extent of speeding up. Speeding up less than 1.6x should works well under many models and prompts.

The output can also be the L1 Rel Distances of Transformer Residual Output before norm.

@LiewFeng
Copy link
Collaborator

LiewFeng commented Jan 9, 2025

Differnce is acceptable if the visual quality doesn't degrade much.

@cellzero
Copy link
Author

cellzero commented Jan 9, 2025

Yes, I think the visual quality is almost the same, although there are some differences. I think this might be caused by the inconsistency between Time Modulated Inputs and Transformer Residual Norm Output of first several steps.

I also tried using the L1 relative distances of the Transformer Residual Output before normalization as the polynomial fitting output. It appears that the Time Modulated Inputs and Transformer Residual Norm Output match better than that (more closely aligned and with less noise). Therefore, I used that to test the generated video results.

I expect to finish this work and close this issue by the end of the week if everything goes fine.

Thank you for your help.

@cellzero
Copy link
Author

Finally, I have organized the code and submitted it to the Ruyi-Models GitHub repository. Now, TeaCache can be used directly in Ruyi, and it's really good to generate videos faster.

Therefore, I'm wondering if I should still submit a Pull Request to the TeaCache repository, as it might be somewhat redundant?

@LiewFeng
Copy link
Collaborator

Congratulations! It's okay to keep it in Ruyi-Models. I will update the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants