llm-compressor
by
vllm-project

Description: Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

View on GitHub ↗

Summary Information

Updated 2 hours ago

Added to GitGenius on February 4th, 2025

Created on June 20th, 2024

Open Issues & Pull Requests: 112 (+0)

Number of forks: 575

Total Stargazers: 3,526 (+1)

Total Subscribers: 30 (+0)

Issue Activity (beta)

Open issues: 46

New in 7 days: 0

Closed in 7 days: 3

Avg open age: 58 days

Stale 30+ days: 26

Stale 90+ days: 4

Recent activity

Opened in 7 days: 0

Closed in 7 days: 3

Comments in 7 days: 4

Events in 7 days: 15

Top labels

bug (324)
enhancement (166)
good first issue (85)
stale (51)
question (46)
vllm (31)
awq (28)
documentation (19)

Most active issues this week

#2480 [Feature] Enable GPU offloading for reduced weight movement - 9 events / 1 comments
#2730 [Bug]: Mix-precision vLLM loading failure (MXFP4+MXFP8) - 3 events / 1 comments
#2879 [Bug]: mistral quantization w4a6 and w8a8 int8 schemes slower than bf16 - 3 events / 1 comments
#2339 [Bug]: Granite 4.0 h small FP8 quantization fails (model can't load) - 2 events / 1 comments
#2560 [Bug]: Cannot properly compress remaining layers of already compressed model (gpt-oss-20b) - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 5.1 hours

90th percentile: 6.0 hours

Tracked items: 698

Most active contributors

dsikka - 1,149 events, 366 issues
brian-dellabetta - 777 events, 255 issues
kylesayrs - 705 events, 259 issues
HDCharles - 240 events, 81 issues
robertgshaw2-redhat - 166 events, 74 issues

Related by overlapping contributors

Detailed Description

LLM Compressor is a Python library designed to optimize large language models for deployment with vLLM through various compression techniques. The repository serves as a Transformers-compatible tool that enables researchers and practitioners to apply quantization algorithms, pruning methods, and other compression strategies to reduce model size and improve inference efficiency while maintaining compatibility with the vLLM inference engine.

The library provides comprehensive support for multiple quantization approaches across different model components. It supports weight and activation quantization in formats including W8A8 (both int8 and fp8), W4AFP8, and microscale formats like NVFP4 and MXFP4. Mixed precision quantization options include W4A16, W8A16, and various microscale variants. Additionally, the library handles attention and KV cache quantization in FP8 and NVFP4 formats. The supported quantization algorithms encompass Simple PTQ, GPTQ, AWQ, SmoothQuant, AutoRound, and rotation-based methods like SpinQuant and QuIP.

Recent developments highlight the library's expanding capabilities. The REAP Expert Pruning Modifier was introduced to reduce VRAM requirements for Mixture-of-Experts models by structurally removing less-relevant experts based on saliency metrics. Support for Transformers v5 was added with improved MoE calibration workflows. The library now supports day-zero quantization for models like DiffusionGemma, Nemotron 3 Ultra, DeepSeek-V4-Flash, Kimi-K2.6, Qwen3.6, and Gemma 4, with pre-quantized checkpoints available on Hugging Face Hub.

The library integrates seamlessly with Hugging Face models and repositories, allowing users to load models directly and apply compression recipes. Compressed models are saved in the compressed-tensors format, ensuring compatibility with vLLM for optimized inference. For handling very large models, the library supports distributed data parallel (DDP) quantization and disk offloading strategies to manage memory constraints during compression.

According to GitGenius activity tracking, the repository demonstrates strong community engagement with 698 tracked issues and pull requests showing a median response latency of zero hours and a mean response latency of 5.1 hours. The most active labels are bug reports (324 items), enhancement requests (166 items), and good first issues (85 items), indicating active maintenance and community contribution opportunities. Key contributors include dsikka with 1148 events, brian-dellabetta with 773 events, and kylesayrs with 704 events. The repository shares overlapping contributors with related projects including vllm-project/vllm, sgl-project/sglang, and nvidia/tensorrt-llm, reflecting its position within a broader ecosystem of LLM optimization tools.

The library offers extensive documentation and examples covering weight-only quantization, weight and activation quantization, KV cache and attention quantization, architecture-specific approaches for MoE and vision-language models, non-uniform quantization, and big model support through sequential onloading and disk offloading. Users can apply quantization through a straightforward oneshot API and immediately use the resulting checkpoints with vLLM for inference.

llm-compressor
by
vllm-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

llm-compressor
by
vllm-projectvllm-project/llm-compressor

Repository Details

llm-compressor by vllm-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

llm-compressor by vllm-projectvllm-project/llm-compressor

Repository Details

llm-compressor
by
vllm-project

llm-compressor
by
vllm-projectvllm-project/llm-compressor