whisper
by
openai

Description: Robust Speech Recognition via Large-Scale Weak Supervision

View on GitHub ↗

Summary Information

Updated 1 hour ago

Added to GitGenius on January 2nd, 2024

Created on September 16th, 2022

Open Issues & Pull Requests: 135 (+0)

Number of forks: 12,749

Total Stargazers: 104,642 (+0)

Total Subscribers: 753 (+0)

Issue Activity (beta)

Open issues: 0

New in 7 days: 0

Closed in 7 days: 0

Avg open age: N/A days

Stale 30+ days: 0

Stale 90+ days: 0

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

No label distribution available yet.

Most active issues this week

No issue events were indexed in the last 7 days.

Full issues analysis pending...

Detailed Description

Whisper is OpenAI's general-purpose speech recognition model trained on a large dataset of diverse audio to perform multiple speech processing tasks. The model can handle multilingual speech recognition, speech translation, language identification, and voice activity detection within a single unified architecture. Rather than requiring separate specialized models for each task, Whisper uses a Transformer sequence-to-sequence architecture where different speech processing tasks are jointly represented as sequences of tokens to be predicted by the decoder. Special tokens serve as task specifiers or classification targets, allowing the multitask training format to consolidate what would traditionally be many stages of a speech-processing pipeline into one model.

The repository provides six model sizes ranging from tiny with 39 million parameters to large with 1550 million parameters, with four of these offering English-only versions. Each size presents different speed and accuracy tradeoffs, with the tiny model running approximately 10 times faster than the large model on an A100 GPU, though at reduced accuracy. A newer turbo model with 809 million parameters offers optimized speed comparable to the base model while maintaining accuracy closer to the large model, though it is not trained for translation tasks. The English-only variants, particularly tiny.en and base.en, tend to perform better for English-specific applications, though this advantage diminishes for larger model sizes.

Installation requires Python 3.8 through 3.11 and recent PyTorch versions, with the codebase originally developed on Python 3.9.9 and PyTorch 1.10.1. The package depends on OpenAI's tiktoken for fast tokenization and requires ffmpeg for audio processing. Users can install via pip from the official package or directly from the repository. Rust may be needed as a build dependency if precompiled wheels are unavailable for the user's platform.

The model's performance varies significantly by language, with word error rates and character error rates documented across the Common Voice 15 and Fleurs datasets. The repository includes command-line tools for transcribing audio files and a Python API for programmatic access. The transcribe method processes audio using a sliding 30-second window with autoregressive sequence-to-sequence predictions. Lower-level functions like detect_language and decode provide direct access to individual model components for more granular control.

The codebase is released under the MIT License, and the repository includes comprehensive documentation through a blog post, academic paper, model card, and Colab notebook example. The project maintains a Show and Tell discussions category where users share extensions, integrations, web demos, and ports to different platforms, indicating an active community building on top of the core Whisper implementation.

whisper
by
openai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

whisper
by
openaiopenai/whisper

Repository Details

whisper by openai

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

whisper by openaiopenai/whisper

Repository Details

whisper
by
openai

whisper
by
openaiopenai/whisper