whisperlivekit
by
quentinfuxa

Description: Simultaneous speech-to-text model

View quentinfuxa/whisperlivekit on GitHub ↗

Summary Information

Updated 59 minutes ago
Added to GitGenius on September 5th, 2025
Created on December 19th, 2024
Open Issues/Pull Requests: 26 (+0)
Number of forks: 965
Total Stargazers: 9,729 (+0)
Total Subscribers: 60 (+0)
Detailed Description

WhisperLiveKit is a fascinating project that combines the power of OpenAI's Whisper speech-to-text model with the real-time capabilities of LiveKit, a WebRTC-based platform for building live audio/video applications. Essentially, it allows for live, low-latency transcription of audio streams in a multi-participant setting, making it ideal for applications like live captioning, real-time meeting notes, or accessibility features in online events. The repository provides the necessary components to integrate Whisper's transcription directly into a LiveKit room.

At its core, the project leverages Whisper.cpp, a C++ port of OpenAI’s Whisper model, optimized for running on CPUs and GPUs. This is crucial for performance, as running a large language model like Whisper in real-time requires significant computational resources. WhisperLiveKit doesn't rely on the OpenAI API directly, which avoids API costs and provides more control over the transcription process. Instead, it downloads and utilizes a quantized Whisper model locally, allowing for offline operation and reduced latency. The project supports various model sizes (tiny, base, small, medium, large) offering a trade-off between accuracy and speed.

The architecture is cleverly designed. Audio streams from LiveKit participants are captured and sent to a dedicated "transcriber" service. This service, built using FastAPI, receives the audio chunks, preprocesses them for Whisper, performs the transcription, and then publishes the resulting text back to the LiveKit room via data channels. This separation of concerns – LiveKit handling the real-time audio/video transport and the FastAPI service handling the transcription – makes the system more modular and scalable. The use of data channels ensures that the transcriptions are delivered to all participants in the room with minimal delay.

Key components include a LiveKit client application (example provided in React) that joins a room and sends/receives audio, a FastAPI server that hosts the Whisper transcriber, and Docker Compose files for easy deployment. The repository also includes configuration options for customizing the Whisper model, audio parameters (sample rate, channels), and LiveKit room settings. A significant feature is the support for speaker diarization, which attempts to identify *who* is speaking at any given time, adding another layer of usefulness to the transcriptions.

The project is still under active development, but it demonstrates a compelling use case for combining cutting-edge speech-to-text technology with real-time communication platforms. It's a valuable resource for developers looking to add live transcription capabilities to their LiveKit-based applications, offering a flexible and cost-effective alternative to relying solely on cloud-based transcription services. The documentation, while still evolving, provides a good starting point for understanding the architecture and deploying the system. Future development likely will focus on improving diarization accuracy, optimizing performance for different hardware configurations, and expanding the range of supported Whisper models.

whisperlivekit
by
quentinfuxaquentinfuxa/whisperlivekit

Repository Details

Fetching additional details & charts...