Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: query register count and use it in a better split_k heuristic #12319

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

This is stacked on #12312.

Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (#12258) the old heuristic is out of date.

This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware.

Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.

Results for Phi-3-mini-4k-instruct-q4.gguf on RTX 4070:

		#12312		this PR		speedup		cm1 #12312	cm1 this PR	speedup
											
        pp5	461.03		458.34		-0.58%		461.86		462.7		0.18%
       pp10	544		588.6		8.20%		247		388.9		57.45%
       pp20	1108.14		1276.74		15.21%		511.9		744.11		45.36%
       pp31	1692.87		1828.31		8.00%		749.72		1064.83		42.03%
       pp32	1789.53		1862.36		4.07%		768.62		1110.72		44.51%
       pp33	1460.1		1594.85		9.23%		801.72		1133.75		41.41%
       pp48	2195.93		2293.06		4.42%		1107.75		1591.38		43.66%
       pp54	2323.3		2546.32		9.60%		1224.24		1813.48		48.13%
       pp63	2781.24		2866.53		3.07%		1375.22		1934.54		40.67%
       pp64	2790.49		3080.54		10.39%		1407.9		1996.65		41.82%
       pp65	2308.42		2569.65		11.32%		1464.71		1834.02		25.21%
       pp80	2747.47		3194.38		16.27%		1740.68		2311.23		32.78%
       pp96	3279.18		3620.33		10.40%		2002.82		2545.4		27.09%
      pp112	3753.71		4262.02		13.54%		2232.6		2870.61		28.58%
      pp113	3724.57		4161.34		11.73%		2282.34		2895.59		26.87%
      pp127	4251.1		4604.27		8.31%		2455.08		3053.41		24.37%
      pp128	4272.33		4563.73		6.82%		2522.77		3084.22		22.26%
      pp129	3471.71		3550.09		2.26%		1873.7		2202.67		17.56%
      pp140	3748.07		3765.13		0.46%		2045.32		2370.6		15.90%
      pp160	4114.33		4302.67		4.58%		2270.51		2550.82		12.35%
      pp180	4299.83		4678.4		8.80%		2499.4		2817.29		12.72%
      pp192	4637.4		4865.83		4.93%		2627.38		2942.17		11.98%
      pp200	4176.58		4405.53		5.48%		2695.02		2978.76		10.53%
      pp210	4361.11		4581.34		5.05%		2796.97		3146.76		12.51%
      pp230	4851.5		4832.13		-0.40%		2977.31		3289.08		10.47%
      pp248	5103.4		5250.52		2.88%		3112.93		3438.53		10.46%
      pp255	5207		5293.25		1.66%		3239.44		3444.64		6.33%
      pp256	5202.13		5433.54		4.45%		3255.91		3466.4		6.46%
      pp257	4406.52		4558.17		3.44%		2814.87		2885.3		2.50%
      pp280	4782.45		4718.36		-1.34%		2994.52		3098.21		3.46%
      pp300	4820.17		4906.11		1.78%		3151.48		3230.58		2.51%
      pp320	5082.74		5162.81		1.58%		3281.19		3339.76		1.79%
      pp350	5040.96		5091.24		1.00%		3507.09		3565.77		1.67%
      pp384	5611.65		5407.71		-3.63%		3667		3647.3		-0.54%
      pp410	5182.3		5115.91		-1.28%		3192.26		3307.35		3.61%
      pp448	5471.17		5446.01		-0.46%		3305.81		3493.4		5.67%
      pp480	5280.59		5380.97		1.90%		3451.57		3573.31		3.53%
      pp490	5412.99		5399.52		-0.25%		3492.56		3610.39		3.37%
      pp511	5555.42		5542.55		-0.23%		3527.09		3704.57		5.03%
      pp512	5571.47		5657.05		1.54%		3568.11		3715.19		4.12%
      pp513	5104.33		5161.24		1.11%		3353.9		3518.17		4.90%
      pp767	5359.22		5374.51		0.29%		3344.61		3526.2		5.43%
      pp768	5411.12		5446.9		0.66%		3358.13		3471.79		3.38%
      pp769	5189.31		5126.24		-1.22%		3167.93		3254.48		2.73%
     pp1023	5537.44		5419.65		-2.13%		3371.87		3434.22		1.85%
     pp1024	5542.94		5438.1		-1.89%		3341.31		3446.15		3.14%
     pp1025	5257.3		5283.95		0.51%		3242.54		3351.02		3.35%
     pp2047	5304.31		5303.66		-0.01%		3019.9		3096.13		2.52%
     pp2048	5284.38		5365.05		1.53%		3026.24		3103.33		2.55%
     pp2049	5103.8		5132.72		0.57%		2950.1		3055.99		3.59%

@jeffbolznv jeffbolznv requested a review from 0cc4m March 10, 2025 20:10
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 10, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 11, 2025

I ran a few general tests on RTX 3090 and found that these changes generally leads to a negative performance delta in the test-backend-ops perf n=512 tests. This isn't by itself an issue, the tests might not be representative.

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     47.51 TFLOPS       36.88 TFLOPS      -10.63 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     69.96 TFLOPS       62.94 TFLOPS       -7.02 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    69.45 TFLOPS       56.43 TFLOPS      -13.02 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    61.33 TFLOPS       51.31 TFLOPS      -10.02 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    55.76 TFLOPS       36.64 TFLOPS      -19.12 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    57.58 TFLOPS       37.69 TFLOPS      -19.89 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    69.95 TFLOPS       53.84 TFLOPS      -16.11 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    46.20 TFLOPS       48.31 TFLOPS        2.11 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    43.96 TFLOPS       30.74 TFLOPS      -13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    45.75 TFLOPS       38.23 TFLOPS       -7.52 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.46 TFLOPS       33.78 TFLOPS      -10.68 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.16 TFLOPS       38.19 TFLOPS       -3.97 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 49.92 TFLOPS       35.66 TFLOPS      -14.26 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  38.29 TFLOPS       36.30 TFLOPS       -1.99 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   48.82 TFLOPS       36.17 TFLOPS      -12.65 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 41.32 TFLOPS       37.70 TFLOPS       -3.62 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   35.51 TFLOPS       27.65 TFLOPS       -7.86 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   23.20 TFLOPS       26.28 TFLOPS        3.08 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  62.19 TFLOPS       43.96 TFLOPS      -18.23 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   47.59 TFLOPS       33.73 TFLOPS      -13.86 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51.36 TFLOPS       34.00 TFLOPS      -17.36 TFLOPS

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants.

model size params backend ngl test t/s PR t/s
llama 8B Q4_0 5.61 GiB 8.03 B Vulkan 99 pp512 4260.97 ± 46.72 3738.02 ± 49.37
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 pp512 4229.91 ± 16.62 3695.16 ± 38.69
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 pp512 3064.17 ± 82.64 3331.21 ± 24.80
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 99 pp512 2864.05 ± 6.29 3180.76 ± 7.12

Any idea what's going on?

@jeffbolznv
Copy link
Collaborator Author

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants

Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 11, 2025

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants

Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it.

Against master, and coopmat2. I tested only the last PR to check the combined difference, I didn't have time to narrow it down further.

@jeffbolznv
Copy link
Collaborator Author

Looks like the regression is from #12258, which is a surprise because I had tested Q4_0 on a smaller model. I'll do some more testing, but I may just end up restoring the large tile size to what it was.

@jeffbolznv
Copy link
Collaborator Author

I'm going to update #12258 with these tile sizes:

        // spec constants and tile sizes for quant matmul (non-Qi_K)
        l_warptile_mmq = { 256, 128, 256, 64 };
        m_warptile_mmq = { 256, 128, 128, 64 };
        s_warptile_mmq = { 256, 32,  64, 128 };
        l_mmq_wg_denoms = { 128, 256, 1 };
        m_mmq_wg_denoms = { 128, 128, 1 };
        s_mmq_wg_denoms = { 32,  64,  1 };

This restores the performance for pp128,256,512, but there are some slowdowns for odd sizes. But those are addressed by the other PRs. Here's what I'm seeing for the Q8_0 model when applied against this PR with all the changes:

	C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf				
		master		12319+fixed tile size		
        pp5	169.8		184.7		8.78%
       pp10	169.8		298.75		75.94%
       pp20	341.27		599.83		75.76%
       pp31	510		881.23		72.79%
       pp32	538.67		911.84		69.28%
       pp33	536.7		927.03		72.73%
       pp48	783.03		1335.44		70.55%
       pp54	868		1419.57		63.54%
       pp63	1028.82		1720.97		67.28%
       pp64	1038.35		1915.35		84.46%
       pp65	1038.4		1387.63		33.63%
       pp80	1249.89		1643.02		31.45%
       pp96	1556.72		1931.27		24.06%
      pp112	1709.25		2194.72		28.40%
      pp113	1722.66		2174.12		26.21%
      pp127	1965.74		2392.2		21.69%
      pp128	2262.21		2462.53		8.86%
      pp129	1582.1		1993.86		26.03%
      pp140	1722.35		2121.21		23.16%
      pp160	1942.86		2472.74		27.27%
      pp180	2172		2655.97		22.28%
      pp192	2255.96		2847.36		26.22%
      pp200	2429.77		2965.03		22.03%
      pp210	2432.29		3064.34		25.99%
      pp230	2652.54		3225.7		21.61%
      pp248	2780.31		3459.56		24.43%
      pp255	2896.08		3544.59		22.39%
      pp256	3571.14		3589.26		0.51%
      pp257	2026.1		2572.38		26.96%
      pp280	2191.4		2821.76		28.77%
      pp300	2341.84		2930.23		25.13%
      pp320	2450.23		3070.16		25.30%
      pp350	2639.66		3218.96		21.95%
      pp384	3014.88		3406.29		12.98%
      pp410	2661.47		3288.71		23.57%
      pp448	2841.52		3528.77		24.19%
      pp480	3018.87		3665.8		21.43%
      pp490	3063.8		3742.95		22.17%
      pp511	3140.85		3882.46		23.61%
      pp512	3742.92		3845.86		2.75%
      pp513	3258.42		3353.9		2.93%
      pp767	3388.73		3724.7		9.91%
      pp768	3638.22		3748.72		3.04%
      pp769	2910.19		3284.2		12.85%
     pp1023	3383.05		3826.83		13.12%
     pp1024	3715.64		3803.52		2.37%
     pp1025	3440.35		3569.74		3.76%
     pp2047	3440.01		3720.55		8.16%
     pp2048	3614.77		3715.93		2.80%
     pp2049	3443.32		3556.74		3.29%

Use VK_KHR_pipeline_executable_properties to query the register count, and
use that to try to better estimate how many workgroups can fit in the SMs.
Particularly with recent tile size changes (ggml-org#12258) the old heuristic is
out of date.

This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would
be good if somebody can hook up the missing details for other hardware.

Calling getPipelineExecutableStatisticsKHR required more fully initializing
Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants