0.2.0 (2025-02-26)
- MP31 Features:
- Squad-level MMA(SQMMA) and Warp-level MMA primitives with rich data types (TF32/FP16/BF16/FP8/S8 etc.).
- Tensor Memory Engine(TME) and RobustBufferAccess primitives.
- New GEMM mainloop and epilogue targeting MP31 architecture that achieve high performance with TME and SQMMA.
- New tile scheduler to support CTA swizzle for MP31 kernels.
- New experimental directory housing the implementations that are not yet stable and may have significant changes in the future.
- Prototype of Flash Attention Forward targeting MP31 architecture with TME, RobustBufferAccess and SQMMA.
- New FP8 GEMM with groupwise scaling.
- Upgrade the backend from CUTLASS/CuTe 3.5.0 to CUTLASS/CuTe 3.6.0.
0.1.1 (2024-09-30)
- MuTe, a core library and backend adapted from CUTLASS CuTe
- Quyuan Features
- MMA primitives: TensorFloat32, BFloat16, Float16, INT8
- FMA/MMA GEMM Kernels targeting the Quyuan architecture
- Note: this is a beta release. Further updates to MUTLASS will include performance improvements, feature enablement, and possible breaking changes to the API
- MUTLASS Profiler, Library, and Utilities
- Two examples that demonstrate the usage of the low-level API and the collective builders to build GEMM kernelS