DeepGEMM
by
deepseek-ai

Description: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

View deepseek-ai/DeepGEMM on GitHub ↗

Summary Information

Updated 13 minutes ago

Added to GitGenius on April 25th, 2026

Created on February 13th, 2025

Open Issues & Pull Requests: 72 (+0)

Number of forks: 933

Total Stargazers: 7,074 (+1)

Total Subscribers: 64 (+0)

Issue Activity (beta feature)

Open issues: 52

New in 7 days: 4

Closed in 7 days: 0

Avg open age: 192 days

Stale 30+ days: 8

Stale 90+ days: 5

Recent activity

Opened in 7 days: 3

Closed in 7 days: 0

Comments in 7 days: 1

Events in 7 days: 3

Top labels

No label distribution available yet.

Most active issues this week

#236 Feature Request: Support sm_120 ( 5090 and blackwell 6000 pro ) - 9 events / 4 comments
#317 DeepSeek-V4 on SM120: tf32_hc_prenorm_gemm and paged_mqa_logits kernels missing - 3 events / 0 comments
#309 fix: operator precedence bug in pack_ue8m0_to_int assertion (mantissa check always passes) - 2 events / 1 comments
#305 need support sm120 - 1 events / 1 comments
#306 About the version of DeepEP for mega_moe_kernel - 1 events / 1 comments

Explore full issue details

Detailed Description

DeepGEMM is a high-performance CUDA kernel library developed by DeepSeek AI, designed to accelerate the core computational primitives used in modern large language models (LLMs). Its primary purpose is to provide optimized implementations of essential operations, particularly General Matrix Multiplications (GEMMs), for NVIDIA GPUs. The library distinguishes itself through its focus on simplicity, efficiency, and a lightweight design, making it a valuable resource for both performance optimization and learning NVIDIA GPU kernel techniques.

The core functionality of DeepGEMM revolves around GEMMs, specifically targeting FP8, FP4, and BF16 data types. It also incorporates support for advanced features like fused Mixture of Experts (MoE) with overlapped communication (Mega MoE), MQA scoring for the lightning indexer, and HyperConnection (HC). These features are crucial for the efficient execution of complex LLM architectures. The library's key innovation lies in its use of a Just-In-Time (JIT) module for runtime compilation of kernels. This approach eliminates the need for CUDA compilation during installation, streamlining the deployment process and enhancing flexibility.

DeepGEMM's architecture draws inspiration from established libraries like CUTLASS and CuTe, but it avoids heavy reliance on their template-based approaches. Instead, it prioritizes a clean and accessible codebase with a limited set of core kernel functions. This design choice makes DeepGEMM easier to understand, modify, and integrate into various projects. Despite its streamlined design, the library achieves performance levels that match or surpass those of expert-tuned libraries across a range of matrix shapes.

The library offers a variety of interfaces to cater to different use cases. For basic matrix multiplications, it provides functions like `fp8_gemm_{nt, nn, tn, tt}` for non-grouped GEMMs. For MoE models, DeepGEMM supports grouped GEMMs with contiguous and masked layouts. The contiguous layout is optimized for scenarios where experts share the same shape, while the masked layout is designed for inference decoding with CUDA graphs. Furthermore, DeepGEMM includes specialized kernels for V3.2 MQA (Multi-Query Attention) scoring, crucial for the lightning indexer used in DeepSeek models. These kernels are available in both non-paged (prefilling) and paged (decoding) versions.

A significant feature of DeepGEMM is its Mega MoE implementation. This feature fuses and overlaps several operations, including EP dispatch, linear layers (FP8xFP4), SwiGLU activation, and EP combine, into a single kernel. This fusion, combined with overlapped NVLink communication and tensor core computation, significantly boosts performance. The library provides utilities for allocating symmetric memory buffers, transforming weights, and managing the input and output data for Mega MoE operations.

DeepGEMM also includes a suite of utility functions that provide control over various aspects of the kernel execution. These functions allow users to set the maximum number of streaming multiprocessors (SMs) to use, configure tensor core utilization, enable or disable Programmatic Dependent Launch (PDL), and manage memory alignment for contiguous layouts. The library also provides functions for transforming scaling factors and aligning tensors for optimal performance.

The repository provides detailed instructions for installation and development, including requirements for specific NVIDIA GPU architectures, CUDA Toolkit versions, and supporting libraries like PyTorch and CUTLASS. It also offers a comprehensive set of environment variables that allow users to fine-tune the compilation process, control debugging information, and optimize performance based on their specific hardware and software configurations. DeepGEMM is released under the MIT License, making it freely available for use and modification.

DeepGEMM
by
deepseek-ai

Summary Information

Issue Activity (beta feature)

Recent activity

Top labels

Most active issues this week

DeepGEMM
by
deepseek-aideepseek-ai/DeepGEMM

Repository Details

DeepGEMM by deepseek-ai

Summary Information

Issue Activity (beta feature)

Recent activity

Top labels

Most active issues this week

DeepGEMM by deepseek-aideepseek-ai/DeepGEMM

Repository Details

DeepGEMM
by
deepseek-ai

DeepGEMM
by
deepseek-aideepseek-ai/DeepGEMM