ktransformers
by
kvcache-ai

Description: A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

View on GitHub ↗

Summary Information

Updated 2 hours ago

Added to GitGenius on November 8th, 2025

Created on July 26th, 2024

Open Issues & Pull Requests: 454 (+0)

Number of forks: 1,342

Total Stargazers: 17,395 (-1)

Total Subscribers: 110 (+0)

Issue Activity (beta)

Open issues: 446

New in 7 days: 2

Closed in 7 days: 4

Avg open age: 267 days

Stale 30+ days: 437

Stale 90+ days: 401

Recent activity

Opened in 7 days: 2

Closed in 7 days: 4

Comments in 7 days: 8

Events in 7 days: 13

Top labels

stale (291)
bug (55)
enhancement (35)
duplicate (1)
good first issue (1)
help wanted (1)

Most active issues this week

#2067 CompressedSafeTensorLoader should normalize compressed-tensors RAWINT4 int32 weight_packed - 11 events / 8 comments
#2065 Hygon DCU (gfx936 / DTK 26.04) build works — two undocumented gaps (libhwloc-dev, pip build isolation) - 3 events / 1 comments
#2072 [bug](cli): port availability check can accept already-bound ports - 2 events / 1 comments
#2073 Windows first-run custom path setup can hang on missing drive roots - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 2.7 hours

Mean response time: 11.6 days

90th percentile: 4.8 days

Tracked items: 1,231

Most active contributors

ErvinXie - 1,045 events, 383 issues
Azure-Tang - 337 events, 155 issues
Atream - 241 events, 125 issues
KMSorSMS - 218 events, 95 issues
qiyuxinlin - 174 events, 104 issues

Related by overlapping contributors

Detailed Description

KTransformers is a research framework developed by MADSys Lab at Tsinghua University in collaboration with Approaching.AI and the community, focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project exposes two primary user-facing capabilities: a high-performance inference system called kt-kernel and a fine-tuning integration with LLaMA-Factory for supervised fine-tuning tasks.

The inference component, kt-kernel, provides CPU-optimized kernel operations designed for heterogeneous LLM inference. It features Intel AMX and AVX512/AVX2 optimized kernels for INT4 and INT8 quantized inference, efficient Mixture-of-Experts optimization with NUMA-aware memory management, and support for both CPU-side INT4/INT8 quantized weights and GPU-side GPTQ quantization. The system offers a clean Python API for integration with frameworks like SGLang and enables CPU-GPU hybrid inference for large MoE models, with heterogeneous expert placement allowing hot experts to run on GPU while cold experts run on CPU. Performance benchmarks show the system achieving 227.85 tokens per second total throughput and 87.58 tokens per second output throughput on DeepSeek-R1-0528 with FP8 precision using 8 L20 GPUs and an Xeon Gold 6454S processor under 8-way concurrency.

The fine-tuning component integrates KTransformers with LLaMA-Factory to enable ultra-large MoE model fine-tuning on limited GPU memory. It supports multi-backend fine-tuning with CPU and GPU hybrid execution, INT8 and INT4 quantization, and demonstrates 6 to 12 times faster training speeds compared to ZeRO-Offload in benchmarked MoE SFT workloads while using approximately half the CPU memory of previous KTransformers fine-tuning approaches. Training speed examples include 3.7 iterations per second for DeepSeek-V3 and DeepSeek-R1 on 4 RTX 4090 GPUs and 8 plus iterations per second for Qwen3-30B-A3B on a single RTX 4090.

The repository maintains active development with extensive model support including recent additions like MiniMax-M3, GLM-5.2, DeepSeek-V4-Flash, and Kimi-K2.5. According to GitGenius tracking data, the project processes issues and pull requests with a median response latency of 2.7 hours across 1229 tracked items, though the mean latency is 277.8 hours. The most active contributors tracked are ErvinXie with 1045 events, Azure-Tang with 337 events, and Atream with 241 events. The project maintains 291 stale issues, 55 bug reports, and 35 enhancement requests as its most active issue categories. The repository shares overlapping contributors with related projects including vllm-project/vllm, ggml-org/llama.cpp, and sgl-project/sglang, indicating active participation in the broader LLM inference optimization ecosystem.

ktransformers
by
kvcache-ai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

ktransformers
by
kvcache-aikvcache-ai/ktransformers

Repository Details

ktransformers by kvcache-ai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

ktransformers by kvcache-aikvcache-ai/ktransformers

Repository Details

ktransformers
by
kvcache-ai

ktransformers
by
kvcache-aikvcache-ai/ktransformers