unsloth
by
unslothai

Description: Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.

View unslothai/unsloth on GitHub ↗

Summary Information

Updated 2 hours ago
Added to GitGenius on May 7th, 2025
Created on November 29th, 2023
Open Issues/Pull Requests: 964 (+0)
Number of forks: 4,374
Total Stargazers: 52,705 (+3)
Total Subscribers: 303 (+0)
Detailed Description

Unsloth is an open-source project aiming to dramatically accelerate large language model (LLM) inference, particularly for models like Llama 2, Mistral, and others based on the transformer architecture. It achieves this through a combination of techniques focused on kernel fusion, memory layout optimization, and efficient data movement, ultimately reducing latency and increasing throughput without requiring model modifications. The core philosophy is to "unsloth" existing models, making them run faster on existing hardware, rather than focusing solely on model architecture changes.

At the heart of Unsloth lies its custom kernels, written in CUDA, that fuse multiple operations within the transformer block into single, optimized routines. Traditional LLM inference often involves numerous small kernel launches, incurring significant overhead. Unsloth's kernel fusion minimizes these launches, reducing the communication cost between the CPU and GPU and maximizing GPU utilization. Specifically, it targets key operations like attention, MLP (Multi-Layer Perceptron), and normalization layers, combining them into highly efficient kernels. A key innovation is the use of a tiling strategy that breaks down large matrices into smaller, manageable blocks, enabling better cache utilization and parallelization.

Memory layout is another crucial aspect of Unsloth's performance gains. The project introduces a novel memory layout called "row-major with group query attention" (RM-GQA). This layout is specifically designed to optimize memory access patterns for GQA, a popular attention mechanism that reduces memory bandwidth requirements. By arranging data in a way that aligns with the access patterns of the fused kernels, Unsloth minimizes memory reads and writes, leading to substantial speedups. The RM-GQA layout is particularly effective for longer sequence lengths, where memory bandwidth becomes a bottleneck.

Unsloth isn't a replacement for existing inference frameworks like vLLM or Text Generation Inference (TGI). Instead, it's designed to be *integrated* with them. It provides a drop-in replacement for the core attention and MLP kernels, allowing users to leverage Unsloth's optimizations within their existing infrastructure. This makes adoption relatively straightforward, as it doesn't require significant changes to the overall inference pipeline. Currently, integrations are available for vLLM and TGI, with ongoing efforts to expand compatibility to other frameworks.

The repository includes comprehensive benchmarking results demonstrating significant performance improvements over standard PyTorch implementations and even other optimization techniques. These benchmarks showcase speedups of up to 3x-4x on various models and hardware configurations, including NVIDIA A100 and H100 GPUs. The project also provides detailed documentation, examples, and scripts for building and deploying Unsloth. It's actively maintained and developed by a growing community, with regular updates and improvements being released. Ultimately, Unsloth aims to democratize access to fast LLM inference by making it easier and more affordable to deploy powerful models.

unsloth
by
unslothaiunslothai/unsloth

Repository Details

Fetching additional details & charts...