[wip] Rocm sparse fix #1868

petrex · 2025-03-11T04:45:16Z

TLDR: fix sparse marlin kernel for rocm

hipFuncSetAttribute fix
_builtin_amdgcn* fix
mma special instruction fix

This pull request includes several updates to the setup.py file and modifications to CUDA and ROCm specific code in the torchao/csrc/cuda/sparse_marlin directory. The most important changes focus on improving compatibility with ROCm, updating assertions, and refining the build process for CUDA and ROCm extensions.

Updates to `setup.py`:

Simplified assertion statements for better readability and consistency. [1] [2] [3]
Removed the unnecessary -t=0 flag from the nvcc compile arguments.
Added new paths and logic to handle ROCm-specific source files and GPU architecture checks. [1] [2] [3]

Modifications to CUDA and ROCm specific code:

Updated cp_async4_pred_zfill, cp_async4_pred, and cp_async4 functions to use the LDS.G instruction for global to LDS transfers on MI300X. [1] [2] [3]
Added ROCm-specific includes and instructions to mma.h to support ROCm architecture and improve performance on MI300X. [1] [2] [3] [4]
Refined functions like to_half4, dequant_4bit, dequant_8bit, scale, and scale_floats to use appropriate ROCm intrinsics and improve compatibility. [1] [2] [3] [4] [5]

Update GPU architecture check to use gcnArchName and improve detection of gfx942 support

Reorganize source file selection logic for CUDA and ROCm builds, improving conditional handling of GPU sources and CUTLASS kernels. Simplify the source file selection process and improve readability of the build configuration.

Modify CUTLASS kernel configuration to explicitly check for non-ROCm platforms when enabling support, ensuring more precise build configuration for different GPU environments.

Move source file collection logic to maintain consistent code organization and improve readability of the build configuration. No functional changes were made to the source file selection process.

Remove the `-t=0` flag from NVCC compilation options, which appears to be unnecessary. This simplifies the compilation configuration without impacting build behavior.

pytorch-bot · 2025-03-11T04:45:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1868

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 3e5a411 with merge base ce05b3f ():

NEW FAILURE - The following job has failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add conditional compilation for ROCm platforms in the sparse Marlin matrix multiply accumulate (MMA) function. This ensures proper inline assembly implementation for both CUDA and ROCm environments, using platform-specific register and instruction handling.

Use __builtin_bit_cast to correctly convert float pairs to half-precision uint32_t values for AMD GPU platforms, ensuring proper type handling in the sparse Marlin matrix multiply accumulate (MMA) implementation.

Update CUDA half-precision operations using __hsub2 and __hfma2 intrinsics to improve performance and precision in sparse matrix multiply-accumulate (MMA) computations.

Update AMD GPU implementation to use __hsub2 and __hmul2 intrinsics for improved performance and precision in half-precision sparse matrix multiply-accumulate computations.

Update AMD GPU implementation to use __builtin_amdgcn_fmul_f32 instead of __builtin_amdgcn_fmul_legacy for more accurate float multiplication in the scale_floats function.

Include necessary ROCm-specific headers for HIP runtime and half-precision operations, with comments addressing potential compiler and architecture considerations for AMD GPU platforms.

Replace __builtin_amdgcn_fmul_f32 with __ocml_fmul_f32 for more accurate and consistent float multiplication in the scale_floats function on AMD GPU platforms.

Replace __builtin_amdgcn_global_load_lds with inline assembly using ds_load_b instruction for more precise and direct global to local data store (LDS) transfer on MI300X AMD GPUs.

Replace __ocml_fmul_f32 with standard C++ multiplication for more readable and straightforward float scaling on AMD MI300X GPUs.

Update cudaFuncSetAttribute call to use reinterpret_cast for correct function pointer handling in the Marlin_24 CUDA kernel, ensuring proper dynamic shared memory configuration.

Refactor cp_async4 functions for ROCm to use explicit ds_load instructions for 4, 8, and 16-byte transfers. Add a fallback mechanism using __builtin_memcpy for unsupported sizes, improving the precision and flexibility of global to local data store (LDS) transfers on MI300X AMD GPUs.

Add missing closing braces in cp_async4_pred_zfill, cp_async4_pred, and cp_async4 functions to ensure proper code structure and prevent potential compilation issues in the ROCm sparse Marlin MMA implementation.

…functions Simplify ROCm global to LDS transfer by removing fallback __builtin_memcpy in cp_async4_pred_zfill, cp_async4_pred, and cp_async4 functions, reducing code complexity while maintaining the primary ds_load_b128 transfer mechanism.

…functions Simplify ROCm global to LDS transfer by removing the 16-byte ds_load_b128 instruction from cp_async4_pred_zfill, cp_async4_pred, and cp_async4 functions, further reducing code complexity and maintaining the core transfer mechanism.

…tion Replace global_load_dwordx4 with multiple ds_read_b32 instructions for better compatibility and support across different ROCm platforms. Modify ldsm4 and ldsm4_t functions to use more widely supported memory load techniques.

Update ldsm4_m device function to use separate ds_read_b32 instructions instead of a single ds_read_b64, improving compatibility and load behavior on ROCm platforms.

Modify the MFMA instruction assembly for AMD GPUs to use correct syntax and operand handling. Replace register constraints with vector register constraints and simplify the instruction format to improve compatibility and readability on ROCm platforms.

petrex added 5 commits March 10, 2025 20:49

Fix ROCm GPU architecture detection in setup.py

3a77641

Update GPU architecture check to use gcnArchName and improve detection of gfx942 support

Refactor CUDA and ROCm source file handling in setup.py

76d68bf

Reorganize source file selection logic for CUDA and ROCm builds, improving conditional handling of GPU sources and CUTLASS kernels. Simplify the source file selection process and improve readability of the build configuration.

Improve CUTLASS kernel support detection for non-Windows platforms

16d22c1

Modify CUTLASS kernel configuration to explicitly check for non-ROCm platforms when enabling support, ensuring more precise build configuration for different GPU environments.

Reorder source file collection in setup.py

7481959

Move source file collection logic to maintain consistent code organization and improve readability of the build configuration. No functional changes were made to the source file selection process.

Remove redundant NVCC compilation flag in setup.py

94d1fb4

Remove the `-t=0` flag from NVCC compilation options, which appears to be unnecessary. This simplifies the compilation configuration without impacting build behavior.

pytorch-bot bot added the module: rocm label Mar 11, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2025

petrex marked this pull request as draft March 11, 2025 04:46

petrex added the topic: bug fix Use this tag for PRs that fix bugs label Mar 11, 2025

petrex added 6 commits March 10, 2025 22:36

Fix ROCm half-precision conversion in sparse Marlin MMA

75f4787

Use __builtin_bit_cast to correctly convert float pairs to half-precision uint32_t values for AMD GPU platforms, ensuring proper type handling in the sparse Marlin matrix multiply accumulate (MMA) implementation.

Optimize half-precision operations in sparse Marlin MMA

cf79039

Update CUDA half-precision operations using __hsub2 and __hfma2 intrinsics to improve performance and precision in sparse matrix multiply-accumulate (MMA) computations.

Optimize ROCm half-precision operations in sparse Marlin MMA

a98a427

Update AMD GPU implementation to use __hsub2 and __hmul2 intrinsics for improved performance and precision in half-precision sparse matrix multiply-accumulate computations.

Fix ROCm float multiplication in sparse Marlin MMA

30bd924

Update AMD GPU implementation to use __builtin_amdgcn_fmul_f32 instead of __builtin_amdgcn_fmul_legacy for more accurate float multiplication in the scale_floats function.

Add ROCm header support for sparse Marlin MMA implementation

66691c3

Include necessary ROCm-specific headers for HIP runtime and half-precision operations, with comments addressing potential compiler and architecture considerations for AMD GPU platforms.

pytorch-bot bot added the ciflow/rocm label Mar 11, 2025

Update ROCm float multiplication in sparse Marlin MMA

04014e7

Replace __builtin_amdgcn_fmul_f32 with __ocml_fmul_f32 for more accurate and consistent float multiplication in the scale_floats function on AMD GPU platforms.

pytorch-bot bot removed the ciflow/rocm label Mar 11, 2025

petrex force-pushed the rocm_sparse_fix branch from 10c8823 to 04014e7 Compare March 11, 2025 20:58

petrex added 2 commits March 11, 2025 14:01

Optimize ROCm global to LDS transfer in sparse Marlin MMA

dc53980

Replace __builtin_amdgcn_global_load_lds with inline assembly using ds_load_b instruction for more precise and direct global to local data store (LDS) transfer on MI300X AMD GPUs.

Simplify ROCm float multiplication in sparse Marlin MMA

6f43e01

Replace __ocml_fmul_f32 with standard C++ multiplication for more readable and straightforward float scaling on AMD MI300X GPUs.

petrex force-pushed the rocm_sparse_fix branch from bf81763 to 6f43e01 Compare March 11, 2025 21:20

pytorch-bot bot added the ciflow/rocm label Mar 11, 2025

Fix CUDA kernel attribute setting in Marlin sparse MMA implementation

ed9282d

Update cudaFuncSetAttribute call to use reinterpret_cast for correct function pointer handling in the Marlin_24 CUDA kernel, ensuring proper dynamic shared memory configuration.

pytorch-bot bot removed the ciflow/rocm label Mar 11, 2025

petrex added 4 commits March 11, 2025 14:30

Fix missing closing braces in ROCm cp_async4 memory transfer functions

c316a98

Add missing closing braces in cp_async4_pred_zfill, cp_async4_pred, and cp_async4 functions to ensure proper code structure and prevent potential compilation issues in the ROCm sparse Marlin MMA implementation.

petrex added 2 commits March 11, 2025 14:57

global_load_dwordx4

ca6c646

pytorch-bot bot added the ciflow/rocm label Mar 11, 2025

Refine ROCm memory load instruction in sparse Marlin ldsm4_m function

63e8d5e

Update ldsm4_m device function to use separate ds_read_b32 instructions instead of a single ds_read_b64, improving compatibility and load behavior on ROCm platforms.

pytorch-bot bot removed the ciflow/rocm label Mar 11, 2025

pytorch-bot bot added the ciflow/rocm label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Rocm sparse fix #1868

[wip] Rocm sparse fix #1868

petrex commented Mar 11, 2025 •

edited

Loading

pytorch-bot bot commented Mar 11, 2025 •

edited

Loading

[wip] Rocm sparse fix #1868

Are you sure you want to change the base?

[wip] Rocm sparse fix #1868

Conversation

petrex commented Mar 11, 2025 • edited Loading

Updates to setup.py:

Modifications to CUDA and ROCm specific code:

pytorch-bot bot commented Mar 11, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1868

❌ 1 New Failure

petrex commented Mar 11, 2025 •

edited

Loading

Updates to `setup.py`:

pytorch-bot bot commented Mar 11, 2025 •

edited

Loading