vllm-project/vllm

Description: A high-throughput and memory-efficient inference and serving engine for LLMs

View on GitHub ↗Jump to charts ↓

Summary Information

Updated 38 minutes ago

Added to GitGenius on June 5th, 2024

Created on February 9th, 2023

Open Issues & Pull Requests: 5,688 (+0)

Number of forks: 19,271

Total Stargazers: 85,979 (+0)

Total Subscribers: 579 (+0)

Issue Activity (beta)

Open issues: 1,964

New in 7 days: 32

Closed in 7 days: 43

Avg open age: 60 days

Stale 30+ days: 1,146

Stale 90+ days: 111

Recent activity

Opened in 7 days: 28

Closed in 7 days: 33

Comments in 7 days: 119

Events in 7 days: 318

Top labels

bug (7,422)
stale (6,403)
feature request (1,995)
usage (1,684)
RFC (535)
performance (498)
installation (462)
ci-failure (442)

Most active issues this week

#47648 [Bug]: DeepSeek-V4-Flash-DSpark fails on H200/SM90 with FlashMLA KV cache shape mismatch - 14 events / 7 comments
#47600 [Bug]: ValueError: The supplied chat template string (examples/tool_chat_template_gemma4.jinja) appears path-like - 12 events / 2 comments
#41663 [Bug]: XPU TP=2 on dual Intel Arc Pro B70 (Battlemage): GP fault + xe BCS engine reset reproduces in intel/vllm:0.17.0-xpu on Ubuntu 24.04 HWE 6.17 - 9 events / 2 comments
#45268 [Bug]: EngineDeadError after first L1 sleep/wake cycle with --kv-offloading-backend native + --enable-sleep-mode (v0.22.1) - 8 events / 1 comments
#46710 [Bug] DeepSeekV4-Flash produces incorrect output with inline system messages after PR #46025 when `preserved in-place` - 8 events / 2 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: N/A

Mean response time: 56.9 days

90th percentile: 221.8 days

Tracked items: 9,006

Most active contributors

DarkLight1337 - 6,579 events, 2,499 issues
youkaichao - 1,876 events, 829 issues
robertgshaw2-redhat - 1,745 events, 880 issues
hmellor - 1,438 events, 670 issues
AndreasKaratzas - 1,233 events, 250 issues

Related by overlapping contributors

Detailed Description

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab and now maintained by a community of over 2000 contributors across academic institutions and companies. The project has grown into one of the most active open-source AI initiatives, with 85379 stargazers as of the most recent tracking period and consistent community engagement reflected in its active issue and pull request management.

The core purpose of vLLM is to provide fast and easy-to-use LLM inference and serving capabilities. Performance optimization is central to the project's design, achieved through several key technical innovations. PagedAttention enables efficient management of attention key and value memory, while continuous batching of incoming requests and chunked prefill strategies maximize throughput. The engine supports fast and flexible model execution via piecewise and full CUDA and HIP graphs. vLLM implements extensive quantization support including FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ, AWQ, GGUF, and compressed-tensors formats. Optimized attention kernels such as FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton implementations further enhance performance, alongside optimized GEMM and Mixture-of-Expert kernels using CUTLASS and CuTeDSL. The system also incorporates speculative decoding techniques including n-gram, suffix, EAGLE, and DFlash approaches, plus automatic kernel generation through torch.compile.

vLLM supports over 200 model architectures from Hugging Face, spanning decoder-only LLMs like Llama and Qwen, Mixture-of-Expert models such as Mixtral and DeepSeek-V3, hybrid attention and state-space models like Mamba, multi-modal models including LLaVA and Qwen-VL, and embedding and retrieval models. The platform provides flexible deployment options through tensor, pipeline, data, expert, and context parallelism for distributed inference, alongside support for multiple hardware backends including NVIDIA GPUs, AMD GPUs, x86/ARM/PowerPC CPUs, Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, and Apple Silicon.

The project maintains an OpenAI-compatible API server alongside Anthropic Messages API and gRPC support, enabling seamless integration into existing workflows. Additional features include structured output generation via xgrammar or guidance, tool calling and reasoning parsers, efficient multi-LoRA support for both dense and MoE layers, and streaming output capabilities.

Community engagement remains robust, with GitGenius tracking showing a median issue and pull request response latency of 0.0 hours and a mean of 1276.6 hours across 9632 items. The most active contributor, DarkLight1337, has logged 6579 events, while youkaichao and robertgshaw2-redhat have contributed 1876 and 1745 events respectively. The stale label appears most frequently across 4522 issues, followed by bug reports at 4408 and usage-related issues at 1357. Recent activity shows the project closing issues faster than opening them, with open issues declining from 5557 to 5535 in the most recent tracking period. The repository shares overlapping contributors with major projects including PyTorch, Microsoft VSCode, and TypeScript, indicating deep integration within the broader AI and software development ecosystems.

vllm-project/vllm

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

vllm
by
vllm-projectvllm-project/vllm

Repository Details

vllm-project/vllm

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

vllm by vllm-projectvllm-project/vllm

Repository Details

vllm
by
vllm-projectvllm-project/vllm