Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
RetroCirce committed Nov 6, 2024
1 parent d2c4656 commit 4bca94a
Show file tree
Hide file tree
Showing 90 changed files with 1,581 additions and 7,990 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: b34b5dcc849dfc5c0b598ed8e0ccac59
config: f2da3651d1e17cc67f0575e41e054c50
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _images/generation/conds.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/definition.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/diff1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/diff2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/diff3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/dit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/encodec.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/evaluation-fid.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/evaluation-is.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_arch.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_l1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_p1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_p2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/timeline.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
102 changes: 0 additions & 102 deletions _sources/annotation/code.ipynb

This file was deleted.

12 changes: 0 additions & 12 deletions _sources/annotation/data.md

This file was deleted.

217 changes: 0 additions & 217 deletions _sources/annotation/datasets.ipynb

This file was deleted.

56 changes: 0 additions & 56 deletions _sources/annotation/datasets.md

This file was deleted.

15 changes: 0 additions & 15 deletions _sources/annotation/evaluation.md

This file was deleted.

29 changes: 0 additions & 29 deletions _sources/annotation/intro.md

This file was deleted.

49 changes: 0 additions & 49 deletions _sources/annotation/models.md

This file was deleted.

55 changes: 0 additions & 55 deletions _sources/annotation/tasks.md

This file was deleted.

218 changes: 0 additions & 218 deletions _sources/annotation/test.ipynb

This file was deleted.

12 changes: 6 additions & 6 deletions _sources/generation/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ We present **Evaluation** before discussing model architecture because, in a gen

## Listening Test

The subjective listening test is the most effective method to evaluate the performance of music generation models. Drawing from techniques used in speech generation, two commonly applied methods in the subjective listening tests for audio generation are the Mean Opinion Score (MOS) {cite}`musicgenerationtemplate` and MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) {cite}`musicgenerationtemplate`. These methods provide valuable insights into listener perceptions and the overall quality of generated music.
The subjective listening test is the most effective method to evaluate the performance of music generation models. Drawing from techniques used in speech generation, two commonly applied methods in the subjective listening tests for audio generation are the Mean Opinion Score (MOS) {cite}`DBLP:conf/icassp/GriffinL83` and MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) {cite}`mushra`. These methods provide valuable insights into listener perceptions and the overall quality of generated music.

### MOS Test (Mean Opinion Score)

The purpose of the MOS (Mean Opinion Score) test is to evaluate the overall quality of a **single audio stimulus**. This method has been widely used in text-to-speech generation tasks {cite}`musicgenerationtemplate`, as well as in telecommunications and audio codec systems. The setup for a MOS test is cost-effective and straightforward, where testers rate each audio stimulus on a scale from 1 (poor) to 5 (excellent) based on their perception of audio quality or other specific criteria.
The purpose of the MOS (Mean Opinion Score) test is to evaluate the overall quality of a **single audio stimulus**. This method has been widely used in text-to-speech generation tasks, as well as in telecommunications and audio codec systems. The setup for a MOS test is cost-effective and straightforward, where testers rate each audio stimulus on a scale from 1 (poor) to 5 (excellent) based on their perception of audio quality or other specific criteria.

One of the strengths of the MOS test is its suitability for situations where the overall subjective quality of a single audio piece needs to be assessed, rather than comparing different models or systems. However, the weaknesses lies in its feedback, which is less sensitive to small quality differences between audio stimuli, and it does not provide insights into the reaons behind the rating.

Expand All @@ -22,11 +22,11 @@ One of the strengths of MUSHRA is its ability to provide a more detailed and sen

## Audio Diversity and Quality

In addition to subjective listening tests, researchers have developed several objective metrics to evaluate generation performance from a statistical learning perspective. These metrics, originally derived from image generation tasks, include the Inception Score (IS) {cite}`musicgenerationtemplate` and the Fréchet Inception Distance (FID) {cite}`musicgenerationtemplate`.
In addition to subjective listening tests, researchers have developed several objective metrics to evaluate generation performance from a statistical learning perspective. These metrics, originally derived from image generation tasks, include the Inception Score (IS) {cite}`DBLP:conf/nips/SalimansGZCRCC16` and the Fréchet Inception Distance (FID) {cite}`DBLP:conf/nips/HeuselRUNH17`.

### Inception Score

The Inception Score (IS) is designed to evaluate the diversity and distinctiveness of outputs generated by generative models. To calculate the Inception Score, a representation model, such as VGGish {cite}`musicgenerationtemplate`, PANN {cite}`musicgenerationtemplate`, or CLAP {cite}`musicgenerationtemplate`, is required to create effective embeddings. The calculation process can be summarized in the following steps:
The Inception Score (IS) is designed to evaluate the diversity and distinctiveness of outputs generated by generative models. To calculate the Inception Score, a representation model, such as VGGish {cite}`DBLP:conf/nips/HeuselRUNH17`, PANN {cite}`DBLP:journals/taslp/KongCIWWP20`, or CLAP {cite}`wu2023large`, is required to create effective embeddings. The calculation process can be summarized in the following steps:

1. Use a pretrained representation model to obtain deep neural embeddings for each generated output;
2. Calculate the average embedding of all generated outputs;
Expand All @@ -40,7 +40,7 @@ The second term reflects the evenness of the embeddings. When the IS is high, th

### Fréchet Inception Distance (FID/FAD)

The Fréchet Inception Distance (FID) {cite}`musicgenerationtemplate`, adapted for the audio domain as the Fréchet Audio Distance (FAD) {cite}`musicgenerationtemplate`, provides a comparable result based on the Inception Score, which was adopted into the audio domain as Fréchet Audio Distance (FAD) {cite}`musicgenerationtemplate`. The calculation process can be summarized in the following steps:
The Fréchet Inception Distance (FID) {cite}`DBLP:conf/nips/HeuselRUNH17`, adapted for the audio domain as the Fréchet Audio Distance (FAD) {cite}`DBLP:conf/interspeech/KilgourZRS19`, provides a comparable result based on the Inception Score. The calculation process can be summarized in the following steps:

1. Use a pretrained representation model to obtain deep neural embeddings for both the generated outputs and **the data points in the reference set**;
2. Caculate the average embedding of all generated outputs, the average embedding of reference data, the co-variance matrix of all generated outputs, and the co-variance matrix of reference data;
Expand All @@ -52,7 +52,7 @@ The key difference between IS and FID/FAD is that while IS evaluates the distrib

## Text Relevance

In the text-to-music generation task, it is essential to assess the correspondence between the generated output and the reference textual input to evaluate the performance of multi-modal learning and generation. The CLAP Score {cite}`musicgenerationtemplate` is commonly used for this purpose, leveraging a contrastive language-audio pretraining module:
In the text-to-music generation task, it is essential to assess the correspondence between the generated output and the reference textual input to evaluate the performance of multi-modal learning and generation. The CLAP Score {cite}`wu2023large` is commonly used for this purpose, leveraging a contrastive language-audio pretraining module:

1. Use the pretrained CLAP model to obtain embeddings for both the generated audio and the reference text;
2. Calculate the dot product or cosine similarity for each text-audio pair and average their scores to derive the final CLAP score.
Expand Down
Loading

0 comments on commit 4bca94a

Please sign in to comment.