Description: Metrics for evaluating music and audio generative models – with a focus on long-form, full-band, and stereo generations.
View Stability-AI/stable-audio-metrics on GitHub ↗
Detailed Description
The `stable-audio-metrics` repository, developed by Stability AI, provides a collection of metrics specifically designed for evaluating the performance of music and audio generative models. Its primary purpose is to offer a robust and standardized way to assess the quality of audio generated by these models, particularly focusing on the more complex and realistic scenarios of long-form, full-band, and stereo audio generation. This focus distinguishes it from metrics that might be optimized for simpler audio formats or shorter durations.
The repository offers three core metrics, each leveraging established techniques in audio analysis and comparison. First, it implements the Fréchet Distance, calculated using the Openl3 library. This metric assesses the similarity between audio samples by comparing their feature representations in a learned embedding space. Second, it incorporates Kullback–Leibler divergence, computed using the PaSST library. This metric quantifies the difference between probability distributions of audio features, providing insights into the statistical dissimilarity between generated and reference audio. Finally, it includes the CLAP score, based on the CLAP-LAION model. CLAP (Contrastive Language-Audio Pretraining) scores evaluate how well the generated audio aligns with textual descriptions or prompts, offering a measure of semantic fidelity.
A key feature of `stable-audio-metrics` is its adaptability to variable-length audio inputs, making it suitable for evaluating the long-form audio generations that are increasingly common in modern generative models. The repository is designed to be user-friendly, providing clear installation instructions and example scripts to facilitate its use. Installation involves cloning the repository, creating a Python virtual environment, and installing the necessary dependencies. The repository explicitly supports GPU usage for faster processing, acknowledging that CPU-based computations can be slow. Troubleshooting tips are provided, including potential compatibility issues with older CUDA versions due to dependencies.
The repository's documentation is well-structured, with detailed information available in the source code files for each metric. Example scripts are provided to demonstrate how to use the metrics with popular datasets like MusicCaps, AudioCaps, and Song Describer. These examples showcase how to evaluate generated audio against reference datasets, allowing users to compare the performance of their models. The documentation also includes "no-audio" examples, which enable evaluation without downloading the datasets by utilizing pre-computed statistics and embeddings. This feature streamlines the evaluation process and allows for quicker experimentation.
The repository also provides guidance on data structure, specifying how generated audio files should be organized and named to ensure compatibility with the evaluation scripts. It provides clear examples of the expected file structure for the MusicCaps and AudioCaps datasets, and encourages users to adapt this structure to their own datasets. This standardization simplifies the process of integrating the metrics into existing workflows. Furthermore, the repository offers specific instructions for comparing models against Stable Audio, ensuring a fair comparison by handling resampling and mono/stereo conversions. In essence, `stable-audio-metrics` provides a comprehensive and practical toolkit for researchers and developers working on music and audio generation, enabling them to objectively assess and improve the quality of their models.
Fetching additional details & charts...