flash-attention
by
Dao-AILab

Description: Fast and memory-efficient exact attention

View on GitHub ↗

Summary Information

Updated 1 hour ago

Added to GitGenius on February 25th, 2026

Created on May 19th, 2022

Open Issues & Pull Requests: 1,215 (+0)

Number of forks: 2,900

Total Stargazers: 24,421 (+2)

Total Subscribers: 149 (+0)

Issue Activity (beta)

Open issues: 848

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 458 days

Stale 30+ days: 832

Stale 90+ days: 773

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 1

Events in 7 days: 1

Top labels

No label distribution available yet.

Most active issues this week

#2635 Training Extremely Slow on Qwen3.5-35B-A3B + 8×B300 (280s/step), py-spy Shows FlashAttention-4 CUTLASS JIT Compilation During Backward - 5 events / 1 comments
#2413 FA4 support for RTX 6000 Pro Blackwell - 1 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 9.5 hours

Mean response time: 78.9 days

90th percentile: 286.2 days

Tracked items: 1,096

Most active contributors

tridao - 937 events, 529 issues
johnnynunez - 79 events, 26 issues
ziyuhuang123 - 47 events, 20 issues
Johnsonms - 42 events, 13 issues
phantaurus - 29 events, 9 issues

Related by overlapping contributors

Detailed Description

Flash-Attention is an official implementation repository for FlashAttention and FlashAttention-2, algorithms designed to provide fast and memory-efficient exact attention computation with IO-awareness. The repository is written primarily in Python and addresses a critical bottleneck in transformer-based deep learning models by optimizing how attention mechanisms execute on GPUs.

The core contribution centers on scaled dot product attention computation, implementing the operation softmax(Q @ K^T * softmax_scale) @ V with significant performance improvements over standard PyTorch implementations. The repository provides multiple versions of the algorithm optimized for different hardware generations. FlashAttention-2 represents a complete rewrite that achieves approximately 2x speedup over the original version through improved parallelism and work partitioning. FlashAttention-3 is optimized specifically for Hopper GPUs such as the H100, currently available as a beta release supporting FP16, BF16 forward and backward passes, and FP8 forward passes on CUDA 12.3 and above. FlashAttention-4, written in CuTeDSL, extends support to both Hopper and Blackwell GPUs including the B200.

The repository maintains broad hardware compatibility across multiple platforms. For NVIDIA CUDA, it supports Ampere, Ada, and Hopper GPUs with datatypes fp16 and bf16, accommodating head dimensions up to 256. AMD ROCm support includes two backends: the Composable Kernel backend supporting MI200x, MI250x, MI300x, MI355x, and RDNA 3/4 GPUs, and a Triton backend supporting CDNA and RDNA GPUs with additional features like paged attention, rotary embeddings, and ALiBi. The codebase also references a separate flash-attention-turing repository for Turing GPU support.

Installation requires CUDA or ROCm toolkit, PyTorch 2.2 and above, and several Python packages including packaging, psutil, and ninja. The repository emphasizes that ninja installation is critical for compilation performance, reducing build time from approximately 2 hours to 3-5 minutes on 64-core machines. Windows support exists but requires additional testing.

The changelog documents substantial feature additions across versions. Version 2.1 modified causal masking behavior for unequal sequence lengths. Version 2.2 optimized inference scenarios with small query sequence lengths through KV cache handling. Version 2.3 introduced sliding window attention used in Mistral 7B. Version 2.4 added ALiBi support and deterministic backward passes. Version 2.5 implemented paged KV cache support via PagedAttention. Version 2.6 added softcapping for attention, used in Gemma-2 and Grok models. Version 2.7 achieved compatibility with torch compile.

According to GitGenius activity tracking, the repository shows median issue and pull request response latency of 9.5 hours across 1096 tracked items, with tridao as the dominant contributor at 937 events, followed by johnnynunez with 79 events and ziyuhuang123 with 47 events. The repository connects to major projects including pytorch/pytorch, sgl-project/sglang, and vllm-project/vllm through overlapping contributors. The codebase is classified across multiple domains including attention mechanisms, transformers, GPU optimization, memory efficiency, performance acceleration, and large language models, reflecting its central role in modern AI training infrastructure.

flash-attention
by
Dao-AILab

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

flash-attention
by
Dao-AILabDao-AILab/flash-attention

Repository Details

flash-attention by Dao-AILab

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

flash-attention by Dao-AILabDao-AILab/flash-attention

Repository Details

flash-attention
by
Dao-AILab

flash-attention
by
Dao-AILabDao-AILab/flash-attention