-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: query register count and use it in a better split_k heuristic #12319
base: master
Are you sure you want to change the base?
Conversation
I ran a few general tests on RTX 3090 and found that these changes generally leads to a negative performance delta in the test-backend-ops perf n=512 tests. This isn't by itself an issue, the tests might not be representative.
When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants.
Any idea what's going on? |
Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it. |
Against master, and coopmat2. I tested only the last PR to check the combined difference, I didn't have time to narrow it down further. |
Looks like the regression is from #12258, which is a surprise because I had tested Q4_0 on a smaller model. I'll do some more testing, but I may just end up restoring the large tile size to what it was. |
I'm going to update #12258 with these tile sizes:
This restores the performance for pp128,256,512, but there are some slowdowns for odd sizes. But those are addressed by the other PRs. Here's what I'm seeing for the Q8_0 model when applied against this PR with all the changes:
|
Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (ggml-org#12258) the old heuristic is out of date. This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware. Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.
This is stacked on #12312.
Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (#12258) the old heuristic is out of date.
This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware.
Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.
Results for Phi-3-mini-4k-instruct-q4.gguf on RTX 4070: