vllm-omni
by
vllm-project

Description: A framework for efficient model inference with omni-modality models

View on GitHub ↗

Summary Information

Updated 1 hour ago

Added to GitGenius on December 23rd, 2025

Created on September 11th, 2025

Open Issues & Pull Requests: 1,093 (+1)

Number of forks: 1,260

Total Stargazers: 5,508 (+5)

Total Subscribers: 52 (+0)

Issue Activity (beta)

Open issues: 540

New in 7 days: 26

Closed in 7 days: 44

Avg open age: 36 days

Stale 30+ days: 356

Stale 90+ days: 157

Recent activity

Opened in 7 days: 21

Closed in 7 days: 33

Comments in 7 days: 88

Events in 7 days: 303

Top labels

bug (785)
help wanted (186)
ci-failure (149)
new model (138)
high priority (134)
good first issue (113)
enhancement (77)
medium priority (73)

Most active issues this week

#4710 [RFC]: Ref-Target KV Cache for Video Diffusion Models - 38 events / 9 comments
#4851 [Bug]: Online-serving CI failures in Qwen3 TTS Base and CustomVoice no_async_chunk tests - 15 events / 5 comments
#4872 [RFC]: Refactor `stage_input_processors` - 15 events / 3 comments
#4924 [RFC]: Runner-Owned CUDA Graphs for TTS Vocoder and Code2Wav Stages - 15 events / 1 comments
#4885 [Bug]: Local Perf Test, test_hunyuan_image_tp2_sp2 and test_hunyuan_image_tp2_cfgp2, failed to launch the instance. - 14 events / 3 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 12.7 hours

90th percentile: 3.8 hours

Tracked items: 1,634

Most active contributors

hsliuustc0106 - 1,660 events, 726 issues
Gaohan123 - 1,032 events, 505 issues
yenuo26 - 479 events, 182 issues
linyueqian - 473 events, 174 issues
david6666666 - 436 events, 240 issues

Related by overlapping contributors

Detailed Description

vLLM-Omni is a Python-based framework extending the vLLM inference engine to support omni-modality model serving, enabling efficient inference across text, image, video, and audio data processing. The project addresses a fundamental limitation of the original vLLM, which was designed primarily for text-based autoregressive language models, by introducing support for non-autoregressive architectures like Diffusion Transformers and other parallel generation models that produce heterogeneous outputs beyond traditional text.

The framework achieves high performance through several architectural innovations. It leverages vLLM's efficient KV cache management for autoregressive support while introducing pipelined stage execution with overlapping to maximize throughput. The system uses a fully disaggregated design based on OmniConnector with dynamic resource allocation across processing stages, allowing flexible distribution of computational work. This design enables the framework to handle complex model workflows through a heterogeneous pipeline abstraction that manages the coordination of different modality processing paths.

vLLM-Omni provides significant flexibility for users and developers. It seamlessly integrates with popular Hugging Face models and supports tensor, pipeline, data, and expert parallelism for distributed inference scenarios. The framework offers streaming outputs and maintains an OpenAI-compatible API server, making it accessible to existing applications. The supported model ecosystem is extensive, including omni-modality models like Qwen3-Omni and Cosmos, text-to-speech models such as Qwen3-TTS and CosyVoice3, and diffusion models for image, video, and audio generation including FLUX and Wan2.2.

Development activity on the repository is substantial and well-coordinated. GitGenius tracking shows a median issue and pull request response latency of zero hours with a mean of 11.4 hours across 1617 items, indicating rapid community engagement. The most active contributor, hsliuustc0106, has logged 1650 events, followed by Gaohan123 with 1005 events and linyueqian with 471 events. Bug reports represent the most common issue label with 775 tracked items, followed by help wanted requests with 185 items and CI failures with 143 items. The project maintains overlapping contributors with major repositories including microsoft/vscode, microsoft/typescript, and rust-lang/rust, suggesting cross-pollination with broader software engineering communities.

Recent release activity demonstrates active feature development. Version 0.22.0, released in June 2026, introduced omnimodal world-model support including Nvidia Cosmos3 and DreamZero, expanded quantization coverage across Blackwell, NPU, and XPU hardware, and integrated reinforcement learning through VeRL-Omni. Earlier releases in 2026 progressively strengthened the serving and runtime stack, expanded diffusion model performance, broadened multimodal model coverage, and improved production readiness across multiple deployment platforms including CUDA, ROCm, MUSA, NPU, and XPU backends.

The project is classified across multiple domains including LLM serving, inference engines, high throughput and low latency optimization, large model support, universal hardware compatibility, scalable deployment, AI acceleration, and deep learning. Documentation is comprehensive, with resources including official documentation, a user forum at discuss.vllm.ai, a developer Slack channel, and a published research paper at arxiv.org/abs/2602.02204 detailing the architecture design and performance results.

vllm-omni
by
vllm-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

vllm-omni
by
vllm-projectvllm-project/vllm-omni

Repository Details

vllm-omni by vllm-project

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

vllm-omni by vllm-projectvllm-project/vllm-omni

Repository Details

vllm-omni
by
vllm-project

vllm-omni
by
vllm-projectvllm-project/vllm-omni