Cublaslt Grouped Gemm Access

int m = params.m, n = params.n, k = params.k; float h_alpha = params.alpha; void* workspace = nullptr; size_t workspaceSize = 32 * GitHub Tag:"gpu" | Microsoft Community Hub The set of legal kernel and algorithm choices changes with them. And that is the point most people miss. The runtime is not just r... Microsoft Community Hub 6 sites Accelerating MoE's with a Triton Persistent Cache-Aware Grouped ... Aug 18, 2025 —

To use Grouped GEMM effectively, you generally follow this workflow: cublaslt grouped gemm

In standard GEMM, $M, N, K$ are scalars. In Grouped GEMM, these are effectively arrays of sizes (or handled via strides and offsets). However, the most common high-performance pattern in cuBLASLt for this is utilizing attributes that allow specifying batch counts and strides. int m = params

cublasLtMatmulDesc_t matmulDesc; cublasLtMatmulDescCreate(&matmulDesc, CUDA_R_32F, CUDA_R_16F); Microsoft Community Hub 6 sites Accelerating MoE's with

You can mix FP16, BF16, TF32, and INT8 inputs in the same batch, with FP32 accumulation – perfect for emerging transformer architectures.

Here is a deep dive into cublasLtMatmul with grouped GEMM functionality.

// 3. Create Matmul Descriptor cublasLtMatmulDesc_t matmulDesc; cublasLtMatmulDescInit(&matmulDesc, CUBLAS_COMPUTE_32F, CUDA_R_32F); // Set transposes if needed (N/N is standard) int transa = CUBLAS_OP_N; int transb = CUBLAS_OP_N; cublasLtMatmulDescSetAttribute(matmulDesc, CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa));