DeepGEMM is a high-performance CUDA kernel library developed by DeepSeek AI, designed to accelerate the core computational primitives used in modern large language models (LLMs). Its primary purpose is to provide optimized implementations of essential operations, particularly General Matrix Multiplications (GEMMs), for NVIDIA GPUs. The library distinguishes itself through its focus on simplicity, efficiency, and a lightweight design, making it a valuable resource for both performance optimization and learning NVIDIA GPU kernel techniques.
The core functionality of DeepGEMM revolves around GEMMs, specifically targeting FP8, FP4, and BF16 data types. It also incorporates support for advanced features like fused Mixture of Experts (MoE) with overlapped communication (Mega MoE), MQA scoring for the lightning indexer, and HyperConnection (HC). These features are crucial for the efficient execution of complex LLM architectures. The library's key innovation lies in its use of a Just-In-Time (JIT) module for runtime compilation of kernels. This approach eliminates the need for CUDA compilation during installation, streamlining the deployment process and enhancing flexibility.
DeepGEMM's architecture draws inspiration from established libraries like CUTLASS and CuTe, but it avoids heavy reliance on their template-based approaches. Instead, it prioritizes a clean and accessible codebase with a limited set of core kernel functions. This design choice makes DeepGEMM easier to understand, modify, and integrate into various projects. Despite its streamlined design, the library achieves performance levels that match or surpass those of expert-tuned libraries across a range of matrix shapes.
The library offers a variety of interfaces to cater to different use cases. For basic matrix multiplications, it provides functions like `fp8_gemm_{nt, nn, tn, tt}` for non-grouped GEMMs. For MoE models, DeepGEMM supports grouped GEMMs with contiguous and masked layouts. The contiguous layout is optimized for scenarios where experts share the same shape, while the masked layout is designed for inference decoding with CUDA graphs. Furthermore, DeepGEMM includes specialized kernels for V3.2 MQA (Multi-Query Attention) scoring, crucial for the lightning indexer used in DeepSeek models. These kernels are available in both non-paged (prefilling) and paged (decoding) versions.
A significant feature of DeepGEMM is its Mega MoE implementation. This feature fuses and overlaps several operations, including EP dispatch, linear layers (FP8xFP4), SwiGLU activation, and EP combine, into a single kernel. This fusion, combined with overlapped NVLink communication and tensor core computation, significantly boosts performance. The library provides utilities for allocating symmetric memory buffers, transforming weights, and managing the input and output data for Mega MoE operations.
DeepGEMM also includes a suite of utility functions that provide control over various aspects of the kernel execution. These functions allow users to set the maximum number of streaming multiprocessors (SMs) to use, configure tensor core utilization, enable or disable Programmatic Dependent Launch (PDL), and manage memory alignment for contiguous layouts. The library also provides functions for transforming scaling factors and aligning tensors for optimal performance.
The repository provides detailed instructions for installation and development, including requirements for specific NVIDIA GPU architectures, CUDA Toolkit versions, and supporting libraries like PyTorch and CUTLASS. It also offers a comprehensive set of environment variables that allow users to fine-tune the compilation process, control debugging information, and optimize performance based on their specific hardware and software configurations. DeepGEMM is released under the MIT License, making it freely available for use and modification.