CosyVoice
by
FunAudioLLM

Description: Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.

View on GitHub ↗

Summary Information

Updated 12 minutes ago

Added to GitGenius on December 29th, 2025

Created on July 3rd, 2024

Open Issues & Pull Requests: 767 (+0)

Number of forks: 2,541

Total Stargazers: 22,072 (+4)

Total Subscribers: 133 (+0)

Issue Activity (beta)

Open issues: 742

New in 7 days: 0

Closed in 7 days: 6

Avg open age: 458 days

Stale 30+ days: 712

Stale 90+ days: 701

Recent activity

Opened in 7 days: 0

Closed in 7 days: 5

Comments in 7 days: 8

Events in 7 days: 17

Top labels

stale (847)

Most active issues this week

#1481 生成的音频是胡言乱语 - 4 events / 2 comments
#1740 CosyVoice 2.0 exhibits audio stuttering during zero-shot streaming inference - 4 events / 2 comments
#1397 Cosyvoice2 SFT 验证集loss只升不降 - 2 events / 1 comments
#1400 inference_instruct2 自然语言推理传入 zero_shot_spk_id，会使生成的语音混乱 - 2 events / 1 comments
#1446 使用transformers==4.53.1版本，生成语音会混乱 - 2 events / 1 comments

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 11.4 hours

Mean response time: 2.5 days

90th percentile: 5.9 days

Tracked items: 1,481

Most active contributors

aluminumbox - 1,192 events, 932 issues
JohnHerry - 230 events, 107 issues
ScottishFold007 - 59 events, 41 issues
hjj-lmx - 43 events, 30 issues
wang-TJ-20 - 41 events, 28 issues

Related by overlapping contributors

Detailed Description

CosyVoice is a multilingual text-to-speech system built on large language models that provides complete inference, training, and deployment capabilities. The repository is written in Python and represents the latest iteration in a series of models, with Fun-CosyVoice 3.0 being the current version. The project is actively maintained with a median issue and pull request response latency of 11.4 hours across 1481 tracked items, indicating strong community engagement and developer responsiveness.

The system supports synthesis across 9 major languages including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian, along with 18 or more Chinese dialects and accents such as Cantonese, Minnan, Sichuan, and Northeastern Chinese. A distinctive feature is its support for zero-shot multilingual and cross-lingual voice cloning, allowing users to generate speech in different languages using reference voice samples without requiring language-specific training data. The model achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness according to evaluation benchmarks presented in the repository.

Fun-CosyVoice 3.0 introduces several advanced capabilities beyond basic text-to-speech. It supports pronunciation inpainting for Chinese Pinyin and English CMU phonemes, providing fine-grained control over pronunciation for production use cases. The system includes text normalization that handles numbers, special symbols, and various text formats without requiring a traditional frontend module. Bi-streaming support enables both text-input streaming and audio-output streaming with latency as low as 150 milliseconds while maintaining high-quality output. The model also responds to various instruction types including language selection, dialect specification, emotional tone, speech speed, and volume adjustments.

The repository demonstrates active development with a clear roadmap spanning from 2024 through 2025. Recent additions include vLLM support for CosyVoice2 and 3, TensorRT-LLM integration for 4x acceleration in deployment scenarios, and Triton TensorRT-LLM runtime support contributed by NVIDIA. The project includes training infrastructure with support for flow matching and reinforcement learning approaches, along with FastAPI server and client implementations for service deployment.

The primary contributor aluminumbox has logged 1192 events in the repository, with additional active contributors JohnHerry and ScottishFold007 contributing 230 and 59 events respectively. The most prevalent issue label is stale with 738 occurrences, reflecting the volume of discussions and feature requests. The repository maintains connections with major open-source projects including Microsoft's VSCode and TypeScript repositories as well as the Rust language repository through overlapping contributors.

CosyVoice is positioned as part of the broader FunAudioLLM ecosystem, which includes complementary projects like FunASR for speech recognition, SenseVoice for emotion detection, and FunClip for AI video processing. The models are available through multiple distribution channels including ModelScope and Hugging Face, with evaluation datasets and papers published for transparency and reproducibility.

CosyVoice
by
FunAudioLLM

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

CosyVoice
by
FunAudioLLMFunAudioLLM/CosyVoice

Repository Details

CosyVoice by FunAudioLLM

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

CosyVoice by FunAudioLLMFunAudioLLM/CosyVoice

Repository Details

CosyVoice
by
FunAudioLLM

CosyVoice
by
FunAudioLLMFunAudioLLM/CosyVoice