vulkan: query register count and use it in a better split_k heuristic #12319

jeffbolznv · 2025-03-10T20:10:58Z

This is stacked on #12312.

Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (#12258) the old heuristic is out of date.

This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware.

Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.

Results for Phi-3-mini-4k-instruct-q4.gguf on RTX 4070:

		#12312		this PR		speedup		cm1 #12312	cm1 this PR	speedup
											
        pp5	461.03		458.34		-0.58%		461.86		462.7		0.18%
       pp10	544		588.6		8.20%		247		388.9		57.45%
       pp20	1108.14		1276.74		15.21%		511.9		744.11		45.36%
       pp31	1692.87		1828.31		8.00%		749.72		1064.83		42.03%
       pp32	1789.53		1862.36		4.07%		768.62		1110.72		44.51%
       pp33	1460.1		1594.85		9.23%		801.72		1133.75		41.41%
       pp48	2195.93		2293.06		4.42%		1107.75		1591.38		43.66%
       pp54	2323.3		2546.32		9.60%		1224.24		1813.48		48.13%
       pp63	2781.24		2866.53		3.07%		1375.22		1934.54		40.67%
       pp64	2790.49		3080.54		10.39%		1407.9		1996.65		41.82%
       pp65	2308.42		2569.65		11.32%		1464.71		1834.02		25.21%
       pp80	2747.47		3194.38		16.27%		1740.68		2311.23		32.78%
       pp96	3279.18		3620.33		10.40%		2002.82		2545.4		27.09%
      pp112	3753.71		4262.02		13.54%		2232.6		2870.61		28.58%
      pp113	3724.57		4161.34		11.73%		2282.34		2895.59		26.87%
      pp127	4251.1		4604.27		8.31%		2455.08		3053.41		24.37%
      pp128	4272.33		4563.73		6.82%		2522.77		3084.22		22.26%
      pp129	3471.71		3550.09		2.26%		1873.7		2202.67		17.56%
      pp140	3748.07		3765.13		0.46%		2045.32		2370.6		15.90%
      pp160	4114.33		4302.67		4.58%		2270.51		2550.82		12.35%
      pp180	4299.83		4678.4		8.80%		2499.4		2817.29		12.72%
      pp192	4637.4		4865.83		4.93%		2627.38		2942.17		11.98%
      pp200	4176.58		4405.53		5.48%		2695.02		2978.76		10.53%
      pp210	4361.11		4581.34		5.05%		2796.97		3146.76		12.51%
      pp230	4851.5		4832.13		-0.40%		2977.31		3289.08		10.47%
      pp248	5103.4		5250.52		2.88%		3112.93		3438.53		10.46%
      pp255	5207		5293.25		1.66%		3239.44		3444.64		6.33%
      pp256	5202.13		5433.54		4.45%		3255.91		3466.4		6.46%
      pp257	4406.52		4558.17		3.44%		2814.87		2885.3		2.50%
      pp280	4782.45		4718.36		-1.34%		2994.52		3098.21		3.46%
      pp300	4820.17		4906.11		1.78%		3151.48		3230.58		2.51%
      pp320	5082.74		5162.81		1.58%		3281.19		3339.76		1.79%
      pp350	5040.96		5091.24		1.00%		3507.09		3565.77		1.67%
      pp384	5611.65		5407.71		-3.63%		3667		3647.3		-0.54%
      pp410	5182.3		5115.91		-1.28%		3192.26		3307.35		3.61%
      pp448	5471.17		5446.01		-0.46%		3305.81		3493.4		5.67%
      pp480	5280.59		5380.97		1.90%		3451.57		3573.31		3.53%
      pp490	5412.99		5399.52		-0.25%		3492.56		3610.39		3.37%
      pp511	5555.42		5542.55		-0.23%		3527.09		3704.57		5.03%
      pp512	5571.47		5657.05		1.54%		3568.11		3715.19		4.12%
      pp513	5104.33		5161.24		1.11%		3353.9		3518.17		4.90%
      pp767	5359.22		5374.51		0.29%		3344.61		3526.2		5.43%
      pp768	5411.12		5446.9		0.66%		3358.13		3471.79		3.38%
      pp769	5189.31		5126.24		-1.22%		3167.93		3254.48		2.73%
     pp1023	5537.44		5419.65		-2.13%		3371.87		3434.22		1.85%
     pp1024	5542.94		5438.1		-1.89%		3341.31		3446.15		3.14%
     pp1025	5257.3		5283.95		0.51%		3242.54		3351.02		3.35%
     pp2047	5304.31		5303.66		-0.01%		3019.9		3096.13		2.52%
     pp2048	5284.38		5365.05		1.53%		3026.24		3103.33		2.55%
     pp2049	5103.8		5132.72		0.57%		2950.1		3055.99		3.59%

0cc4m · 2025-03-11T10:12:55Z

I ran a few general tests on RTX 3090 and found that these changes generally leads to a negative performance delta in the test-backend-ops perf n=512 tests. This isn't by itself an issue, the tests might not be representative.

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     47.51 TFLOPS       36.88 TFLOPS      -10.63 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     69.96 TFLOPS       62.94 TFLOPS       -7.02 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    69.45 TFLOPS       56.43 TFLOPS      -13.02 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    61.33 TFLOPS       51.31 TFLOPS      -10.02 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    55.76 TFLOPS       36.64 TFLOPS      -19.12 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    57.58 TFLOPS       37.69 TFLOPS      -19.89 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    69.95 TFLOPS       53.84 TFLOPS      -16.11 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    46.20 TFLOPS       48.31 TFLOPS        2.11 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    43.96 TFLOPS       30.74 TFLOPS      -13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    45.75 TFLOPS       38.23 TFLOPS       -7.52 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    44.46 TFLOPS       33.78 TFLOPS      -10.68 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    42.16 TFLOPS       38.19 TFLOPS       -3.97 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 49.92 TFLOPS       35.66 TFLOPS      -14.26 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  38.29 TFLOPS       36.30 TFLOPS       -1.99 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   48.82 TFLOPS       36.17 TFLOPS      -12.65 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 41.32 TFLOPS       37.70 TFLOPS       -3.62 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   35.51 TFLOPS       27.65 TFLOPS       -7.86 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   23.20 TFLOPS       26.28 TFLOPS        3.08 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  62.19 TFLOPS       43.96 TFLOPS      -18.23 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   47.59 TFLOPS       33.73 TFLOPS      -13.86 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  51.36 TFLOPS       34.00 TFLOPS      -17.36 TFLOPS

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants.

model	size	params	backend	ngl	test	t/s	PR t/s
llama 8B Q4_0	5.61 GiB	8.03 B	Vulkan	99	pp512	4260.97 ± 46.72	3738.02 ± 49.37
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	pp512	4229.91 ± 16.62	3695.16 ± 38.69
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	pp512	3064.17 ± 82.64	3331.21 ± 24.80
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	pp512	2864.05 ± 6.29	3180.76 ± 7.12

Any idea what's going on?

jeffbolznv · 2025-03-11T13:17:23Z

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants

Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it.

0cc4m · 2025-03-11T13:22:27Z

When testing models with llama-bench, I can see significant positive differences in pp for k-quants, but it seems that this change is quite negative for the legacy quants

Was this with coopmat2 or coopmat1? And was it comparing this PR against master or #12312? I'll try to reproduce it.

Against master, and coopmat2. I tested only the last PR to check the combined difference, I didn't have time to narrow it down further.

jeffbolznv · 2025-03-11T13:54:20Z

Looks like the regression is from #12258, which is a surprise because I had tested Q4_0 on a smaller model. I'll do some more testing, but I may just end up restoring the large tile size to what it was.

jeffbolznv · 2025-03-11T14:33:14Z

I'm going to update #12258 with these tile sizes:

        // spec constants and tile sizes for quant matmul (non-Qi_K)
        l_warptile_mmq = { 256, 128, 256, 64 };
        m_warptile_mmq = { 256, 128, 128, 64 };
        s_warptile_mmq = { 256, 32,  64, 128 };
        l_mmq_wg_denoms = { 128, 256, 1 };
        m_mmq_wg_denoms = { 128, 128, 1 };
        s_mmq_wg_denoms = { 32,  64,  1 };

This restores the performance for pp128,256,512, but there are some slowdowns for odd sizes. But those are addressed by the other PRs. Here's what I'm seeing for the Q8_0 model when applied against this PR with all the changes:

	C:\models\Meta-Llama-3-8B-Instruct.Q8_0.gguf				
		master		12319+fixed tile size		
        pp5	169.8		184.7		8.78%
       pp10	169.8		298.75		75.94%
       pp20	341.27		599.83		75.76%
       pp31	510		881.23		72.79%
       pp32	538.67		911.84		69.28%
       pp33	536.7		927.03		72.73%
       pp48	783.03		1335.44		70.55%
       pp54	868		1419.57		63.54%
       pp63	1028.82		1720.97		67.28%
       pp64	1038.35		1915.35		84.46%
       pp65	1038.4		1387.63		33.63%
       pp80	1249.89		1643.02		31.45%
       pp96	1556.72		1931.27		24.06%
      pp112	1709.25		2194.72		28.40%
      pp113	1722.66		2174.12		26.21%
      pp127	1965.74		2392.2		21.69%
      pp128	2262.21		2462.53		8.86%
      pp129	1582.1		1993.86		26.03%
      pp140	1722.35		2121.21		23.16%
      pp160	1942.86		2472.74		27.27%
      pp180	2172		2655.97		22.28%
      pp192	2255.96		2847.36		26.22%
      pp200	2429.77		2965.03		22.03%
      pp210	2432.29		3064.34		25.99%
      pp230	2652.54		3225.7		21.61%
      pp248	2780.31		3459.56		24.43%
      pp255	2896.08		3544.59		22.39%
      pp256	3571.14		3589.26		0.51%
      pp257	2026.1		2572.38		26.96%
      pp280	2191.4		2821.76		28.77%
      pp300	2341.84		2930.23		25.13%
      pp320	2450.23		3070.16		25.30%
      pp350	2639.66		3218.96		21.95%
      pp384	3014.88		3406.29		12.98%
      pp410	2661.47		3288.71		23.57%
      pp448	2841.52		3528.77		24.19%
      pp480	3018.87		3665.8		21.43%
      pp490	3063.8		3742.95		22.17%
      pp511	3140.85		3882.46		23.61%
      pp512	3742.92		3845.86		2.75%
      pp513	3258.42		3353.9		2.93%
      pp767	3388.73		3724.7		9.91%
      pp768	3638.22		3748.72		3.04%
      pp769	2910.19		3284.2		12.85%
     pp1023	3383.05		3826.83		13.12%
     pp1024	3715.64		3803.52		2.37%
     pp1025	3440.35		3569.74		3.76%
     pp2047	3440.01		3720.55		8.16%
     pp2048	3614.77		3715.93		2.80%
     pp2049	3443.32		3556.74		3.29%

…s checking

Use VK_KHR_pipeline_executable_properties to query the register count, and use that to try to better estimate how many workgroups can fit in the SMs. Particularly with recent tile size changes (ggml-org#12258) the old heuristic is out of date. This heuristic benefits both coopmat1 and coopmat2 paths on NVIDIA. Would be good if somebody can hook up the missing details for other hardware. Calling getPipelineExecutableStatisticsKHR required more fully initializing Vulkan-HPP. The steps needed are documented in the Vulkan-HPP readme.

jeffbolznv requested a review from 0cc4m March 10, 2025 20:10

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 10, 2025

vulkan: Adjust coopmat2 tile sizes and selection heuristic

1577cfd

jeffbolznv mentioned this pull request Mar 11, 2025

vulkan: Adjust coopmat2 tile sizes and selection heuristic #12258

Open

jeffbolznv added 3 commits March 11, 2025 09:40

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bound…

c5f2920

…s checking

vulkan: use fp32 in coopmat2 q4_k dequant function

717fd25

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader

9a6eaed

jeffbolznv force-pushed the pep_split_k branch from 20379a2 to f4372a6 Compare March 11, 2025 15:05

jeffbolznv force-pushed the pep_split_k branch from f4372a6 to 2e234f0 Compare March 11, 2025 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: query register count and use it in a better split_k heuristic #12319

vulkan: query register count and use it in a better split_k heuristic #12319

jeffbolznv commented Mar 10, 2025

0cc4m commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025

0cc4m commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025

vulkan: query register count and use it in a better split_k heuristic #12319

Are you sure you want to change the base?

vulkan: query register count and use it in a better split_k heuristic #12319

Conversation

jeffbolznv commented Mar 10, 2025

0cc4m commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025

0cc4m commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025

jeffbolznv commented Mar 11, 2025