flash-attention
by
dao-ailab

Description: Fast and memory-efficient exact attention

View dao-ailab/flash-attention on GitHub ↗

Summary Information

Updated 3 hours ago
Added to GitGenius on February 25th, 2026
Created on May 19th, 2022
Open Issues/Pull Requests: 1,061 (+0)
Number of forks: 2,415
Total Stargazers: 22,359 (+1)
Total Subscribers: 147 (+0)
Detailed Description

The "dao-ailab/flash-attention" repository provides an optimized implementation of the attention mechanism, a core component of modern deep learning models, particularly those based on transformers. The primary goal of this repository is to offer a faster and more memory-efficient alternative to the standard attention implementations found in popular deep learning frameworks like PyTorch. The project offers multiple versions, including FlashAttention, FlashAttention-2, and FlashAttention-3, each building upon the previous to improve performance and expand functionality.

The core functionality revolves around the `flash_attn_func` and `flash_attn_qkvpacked_func` functions, which implement the scaled dot-product attention. These functions take query (Q), key (K), and value (V) tensors as input and compute the attention output. FlashAttention achieves its performance gains through several key optimizations. It leverages Input/Output (IO)-awareness, which means it is designed to minimize data movement between the GPU's memory and the on-chip SRAM (shared memory), a major bottleneck in standard attention implementations. This is achieved through techniques like tiling and recomputation, which allow for efficient utilization of the GPU's memory hierarchy. FlashAttention-2 further improves upon this by enhancing parallelism and work partitioning. FlashAttention-3 is specifically optimized for NVIDIA's Hopper GPUs (H100/H800), offering further performance boosts.

The repository's main features include support for various data types (FP16, BF16, and FP8 in FlashAttention-3), multi-query and grouped-query attention (MQA/GQA), causal masking (essential for autoregressive models), sliding window attention (local attention), ALiBi (attention with linear bias), deterministic backward pass, and paged KV cache (for efficient memory management). The implementation also supports NVIDIA CUDA and AMD ROCm, with different backends available for ROCm (Composable Kernel and Triton). The Triton backend offers support for a wider range of AMD GPUs and datatypes. The repository also provides a `flash_attn_with_kvcache` function, which is designed for efficient inference, particularly for iterative decoding scenarios where the KV cache can be updated in-place.

The purpose of this repository is to accelerate the training and inference of transformer-based models. By providing a faster and more memory-efficient attention mechanism, FlashAttention enables researchers and practitioners to train larger models, process longer sequences, and reduce the computational cost of their projects. The repository is actively maintained and updated, with new features and optimizations being added regularly. The project is open-source and available under a permissive license, encouraging widespread adoption and modification. The repository also provides clear instructions for installation and usage, including example code snippets and links to relevant research papers.

The repository's documentation highlights significant performance improvements compared to standard attention implementations, with speedups varying depending on the GPU and sequence length. Benchmarks are provided for various NVIDIA GPUs, demonstrating the benefits of using FlashAttention. The project's changelog details the evolution of the implementation, including the introduction of new features and optimizations. The repository also provides integration with the Hugging Face `kernels` library, making it easy to use FlashAttention within the Hugging Face ecosystem. Overall, the "dao-ailab/flash-attention" repository is a valuable resource for anyone working with transformer models, offering a practical and efficient solution for accelerating attention computations.

flash-attention
by
dao-ailabdao-ailab/flash-attention

Repository Details

Fetching additional details & charts...