MooreThreads MUTLASS Changelog

0.2.0 (2025-02-26)

MP31 Features:
- Squad-level MMA(SQMMA) and Warp-level MMA primitives with rich data types (TF32/FP16/BF16/FP8/S8 etc.).
- Tensor Memory Engine(TME) and RobustBufferAccess primitives.
New GEMM mainloop and epilogue targeting MP31 architecture that achieve high performance with TME and SQMMA.
New tile scheduler to support CTA swizzle for MP31 kernels.
New experimental directory housing the implementations that are not yet stable and may have significant changes in the future.
- Prototype of Flash Attention Forward targeting MP31 architecture with TME, RobustBufferAccess and SQMMA.
New FP8 GEMM with groupwise scaling.
Upgrade the backend from CUTLASS/CuTe 3.5.0 to CUTLASS/CuTe 3.6.0.

MuTe, a core library and backend adapted from CUTLASS CuTe
Quyuan Features
- MMA primitives: TensorFloat32, BFloat16, Float16, INT8
FMA/MMA GEMM Kernels targeting the Quyuan architecture
- Note: this is a beta release. Further updates to MUTLASS will include performance improvements, feature enablement, and possible breaking changes to the API
MUTLASS Profiler, Library, and Utilities
Two examples that demonstrate the usage of the low-level API and the collective builders to build GEMM kernelS