vllm
by
neuralmagic

Description: A high-throughput and memory-efficient inference and serving engine for LLMs

View on GitHub ↗

Summary Information

Updated 13 minutes ago

Added to GitGenius on November 12th, 2024

Created on July 19th, 2024

Open Issues & Pull Requests: 37 (+0)

Number of forks: 7

Total Stargazers: 17 (+0)

Total Subscribers: 0 (+0)

Issue Activity (beta)

Open issues: 0

New in 7 days: 0

Closed in 7 days: 0

Avg open age: N/A days

Stale 30+ days: 0

Stale 90+ days: 0

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

No label distribution available yet.

Most active issues this week

No issue events were indexed in the last 7 days.

Full issues analysis pending...

Detailed Description

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models, originally developed at UC Berkeley's Sky Computing Lab and now maintained by a diverse community of over 2000 contributors across academic institutions and companies. The project has become one of the most active open-source AI initiatives, providing a fast and easy-to-use library that enables efficient LLM inference and serving at scale.

The core of vLLM's performance comes from several key technical innovations. PagedAttention provides efficient management of attention key and value memory, while continuous batching of incoming requests combined with chunked prefill and prefix caching optimizes request processing. The system supports fast and flexible model execution through piecewise and full CUDA/HIP graphs. A comprehensive quantization framework includes support for FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO, and additional quantization techniques. The engine incorporates optimized attention kernels including FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton implementations, alongside optimized GEMM and Mixture-of-Experts kernels using CUTLASS, TRTLLM-GEN, and CuTeDSL. Speculative decoding capabilities include n-gram, suffix, EAGLE, and DFlash approaches, while automatic kernel generation and graph-level transformations leverage torch.compile for further optimization.

vLLM's flexibility extends across multiple dimensions of distributed inference. It supports tensor, pipeline, data, expert, and context parallelism for scaling across multiple devices. The system seamlessly integrates with over 200 model architectures from Hugging Face, including decoder-only LLMs like Llama and Qwen, Mixture-of-Expert models such as Mixtral and DeepSeek-V3, hybrid attention and state-space models like Mamba, multi-modal models including LLaVA and Qwen-VL, and embedding and retrieval models. Hardware support spans NVIDIA GPUs, AMD GPUs, and x86/ARM/PowerPC CPUs, with additional plugins for Google TPUs, Intel Gaudi, IBM Spyre, Huawei Ascend, Rebellions NPU, Apple Silicon, and MetaX GPU.

The serving capabilities include high-throughput inference with various decoding algorithms such as parallel sampling and beam search, streaming outputs, and structured output generation using xgrammar or guidance. vLLM provides an OpenAI-compatible API server alongside Anthropic Messages API and gRPC support, enabling easy integration into existing applications. The system supports efficient multi-LoRA functionality for both dense and Mixture-of-Experts layers, and includes tool calling and reasoning parsers for advanced use cases.

According to GitGenius activity classification, vLLM is actively developed across machine learning, model compression, text generation, NLP, large-scale inference, inference acceleration, and performance optimization domains. The project emphasizes quantization techniques, sparse inference, hardware leveraging, and computational efficiency improvements for transformer models and neural network acceleration, positioning itself as a comprehensive solution for LLM model efficiency and large-scale language model deployment.

vllm
by
neuralmagic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

vllm
by
neuralmagicneuralmagic/vllm

Repository Details

vllm by neuralmagic

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

vllm by neuralmagicneuralmagic/vllm

Repository Details

vllm
by
neuralmagic

vllm
by
neuralmagicneuralmagic/vllm