TensorRT-LLM
by
NVIDIA

Description: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform...

View on GitHub ↗

Summary Information

Updated 2 hours ago

Added to GitGenius on October 3rd, 2024

Created on August 16th, 2023

Open Issues & Pull Requests: 1,488 (+1)

Number of forks: 2,545

Total Stargazers: 14,082 (+0)

Total Subscribers: 118 (+0)

Issue Activity (beta)

Open issues: 603

New in 7 days: 2

Closed in 7 days: 7

Avg open age: 173 days

Stale 30+ days: 547

Stale 90+ days: 429

Recent activity

Opened in 7 days: 2

Closed in 7 days: 6

Comments in 7 days: 7

Events in 7 days: 19

Top labels

triaged (1,545)
bug (1,121)
stale (709)
feature request (604)
AutoDeploy (473)
question (439)
waiting for feedback (312)
Investigating (259)

Most active issues this week

#15941 [Bug]: DGX Spark playbook supported models fail in trtllm-serve: Nemotron-3 Nano Omni and GPT-OSS rejected as unsupported due to MODEL_MAP registry mismatch - 4 events / 2 comments
#14588 [Feature]: [AutoDeploy] Perf analysis & optimize DeekSeek R1 (lower concurrency) - 3 events / 0 comments
#3277 Qwen2.5-72B-Instruct int4_wo can't use `--gather_generation_logits` on A100 - 2 events / 1 comments
#6706 [BUG] Discrepancy in Gemma2 Prefill Hidden States During Inflight Batching (gemm_plugin bf16 numerical error) - 2 events / 1 comments
#7085 [Bug]: BertForSequenceClassification outputs diverge from HuggingFace transformers when input length > 64 - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 0.0 hours

Mean response time: 51.0 days

90th percentile: 267.7 days

Tracked items: 3,152

Most active contributors

karljang - 1,943 events, 760 issues
nv-guomingz - 1,606 events, 634 issues
byshiue - 856 events, 312 issues
hello-11 - 625 events, 333 issues
lucaslie - 615 events, 256 issues

Related by overlapping contributors

Detailed Description

TensorRT LLM is NVIDIA's Python-based framework designed to optimize inference for Large Language Models and visual generation models on NVIDIA GPUs. The project provides users with a pythonic API to define LLMs while incorporating specialized CUDA kernels for common operations, an efficient runtime, and customizable components for building Python and C++ inference servers. The framework enables both high throughput and low latency inference execution on NVIDIA hardware, with particular emphasis on performance optimization across different GPU architectures including Blackwell.

The repository demonstrates active development and community engagement. As of the most recent tracking period, the project maintains 1483 open issues, with the triaged label applied to 1163 issues and bug reports accounting for 1028 tracked items. The median response latency for issues and pull requests across 3152 tracked items is 0.0 hours, indicating rapid community engagement, though the mean response time of 1224.6 hours reflects the complexity of some discussions. The most active contributors tracked include karljang with 1943 events, nv-guomingz with 1606 events, and byshiue with 856 events, demonstrating sustained core team involvement in triaging and development.

TensorRT LLM's technical scope extends across multiple optimization domains. The framework supports mixture-of-experts models, distributed inference patterns including expert parallelism and disaggregated serving, and advanced decoding strategies such as speculative decoding and guided decoding. Recent technical blogs document optimizations for specific models like DeepSeek-R1 and DeepSeek-V3.2 on Blackwell GPUs, video generation scaling across NVL72 racks, and techniques like sparse attention and skip softmax attention for long-context inference. The project also supports visual generation through diffusion models, expanding beyond pure language model inference.

The repository's practical impact is evidenced by real-world deployments and performance achievements. Recent announcements highlight Llama 4 inference at over 40,000 tokens per second on B200 GPUs and support for models including GPT-OSS-120B, EXAONE 4.0, and Llama 3.3 70B. Integration examples show TensorRT LLM deployment on AWS EKS with auto-scaling capabilities and adoption by companies like NAVER Place for small language model optimization and Bing for search model inference.

The codebase is classified across 21 distinct technical categories including inference optimization, hardware acceleration, model deployment, performance tuning, and deep learning frameworks. The project maintains comprehensive documentation covering architecture, performance benchmarks, quick-start examples, and a published roadmap. Cross-repository contributor overlap with microsoft/vscode, microsoft/typescript, and rust-lang/rust suggests the project draws expertise from broader systems and language communities, though the primary development remains focused on NVIDIA's GPU acceleration platform.

TensorRT-LLM
by
NVIDIA

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

TensorRT-LLM
by
NVIDIANVIDIA/TensorRT-LLM

Repository Details

TensorRT-LLM by NVIDIA

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

TensorRT-LLM by NVIDIANVIDIA/TensorRT-LLM

Repository Details

TensorRT-LLM
by
NVIDIA

TensorRT-LLM
by
NVIDIANVIDIA/TensorRT-LLM