FlashMLA
by
deepseek-ai

Description: FlashMLA: Efficient Multi-head Latent Attention Kernels

View on GitHub ↗

Summary Information

Updated 30 minutes ago

Added to GitGenius on January 28th, 2026

Created on February 21st, 2025

Open Issues & Pull Requests: 111 (+0)

Number of forks: 1,088

Total Stargazers: 12,742 (+0)

Total Subscribers: 106 (+0)

Issue Activity (beta)

Open issues: 68

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 259 days

Stale 30+ days: 65

Stale 90+ days: 61

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

No label distribution available yet.

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 13.2 hours

Mean response time: 23.9 days

90th percentile: 79.2 days

Tracked items: 85

Most active contributors

interestingLSY - 20 events, 11 issues
beginlner - 14 events, 12 issues
IISuperluminaLII - 7 events, 4 issues
yewentao256 - 7 events, 1 issues
EricLina - 4 events, 2 issues

Related by overlapping contributors

Detailed Description

FlashMLA is DeepSeek's library of optimized attention kernels designed to accelerate multi-head latent attention computations on modern GPUs. The repository contains implementations powering the DeepSeek-V3 and DeepSeek-V3.2-Exp large language models. Written primarily in C++, it provides both sparse and dense attention kernels optimized for different stages of transformer inference and training.

The library implements two main categories of kernels. Sparse attention kernels power DeepSeek Sparse Attention (DSA) with token-level sparse attention for both prefill and decoding stages, where the decoding variant supports FP8 KV cache quantization. Dense attention kernels provide standard attention computation for prefill and decoding stages. The sparse decoding kernel achieves up to 410 TFlops on H800 SXM5 GPUs and 350 TFlops on B200, while the dense decoding kernel reaches 660 TFlops in compute-bound configurations and 3000 GB/s in memory-bound configurations on H800 SXM5. For prefill operations, the sparse MLA kernel achieves 640 TFlops on H800 SXM5 and 1450 TFlops on B200, while the dense MHA kernel reaches 1460 TFlops in forward and 1000 TFlops in backward computation on B200.

The kernels support SM90 and SM100 GPU architectures with CUDA 12.8 and above, requiring PyTorch 2.0 or later. The FP8 KV cache format stores quantized key-value data in a specific structure with 512 bytes of quantized values, 16 bytes of scale factors, and 128 bytes of unquantized RoPE embeddings per token. Sparse attention is enabled through an indices tensor that specifies which tokens to attend to, allowing the kernel to skip unnecessary computations.

GitGenius activity tracking shows the repository has moderate engagement with a median issue and pull request response latency of 13.2 hours across 85 tracked items, though the mean latency of 573.1 hours indicates some items receive delayed responses. The most active contributors tracked are interestingLSY with 20 events, beginlner with 14 events, and IISuperluminaLII with 7 events. The repository shares contributors with related projects including sgl-project/sglang, vllm-project/vllm, and nvidia/tensorrt-llm, indicating integration within the broader ecosystem of LLM inference optimization tools.

Recent development includes a September 2025 release of sparse attention kernels coinciding with DeepSeek-V3.2 launch, an April 2025 performance update delivering 5 to 15 percent improvements for compute-bound workloads, and August 2025 additions of MHA kernels for SM100 architecture contributed by NVIDIA. The library provides detailed documentation including deep-dive technical blogs explaining kernel implementation details and FP8 sparse decoding optimization strategies.

FlashMLA supports community implementations across multiple GPU architectures through partnerships, with corresponding versions available for MetaX, Moore Threads, Hygon DCU, Intellifusion NNP, Iluvatar Corex, and AMD Instinct GPUs. The kernels are designed to be compatible with existing interfaces while providing performance improvements, allowing users to upgrade without modifying calling code. The implementation draws inspiration from FlashAttention 2 and 3 projects and NVIDIA's CUTLASS library.

FlashMLA
by
deepseek-ai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

FlashMLA
by
deepseek-aideepseek-ai/FlashMLA

Repository Details

FlashMLA by deepseek-ai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

FlashMLA by deepseek-aideepseek-ai/FlashMLA

Repository Details

FlashMLA
by
deepseek-ai

FlashMLA
by
deepseek-aideepseek-ai/FlashMLA