Skip to content

Commit

Permalink
Merge pull request #2 from RetroCirce/gh-pages
Browse files Browse the repository at this point in the history
add text-to-music generation section
  • Loading branch information
RetroCirce authored Oct 26, 2024
2 parents a4f290b + 23fed56 commit 55145be
Show file tree
Hide file tree
Showing 379 changed files with 53,462 additions and 3,671 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 79bb7ee00347aa10773981b437044cde
config: 4f3b7b65b49c13cc9151b09ed2b87d73
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _images/definition.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/evaluation-fid.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/evaluation-is.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/definition.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/encodec.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/evaluation-fid.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/evaluation-is.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_arch.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_l1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_p1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/musicgen_p2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/generation/timeline.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/timeline.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions _sources/generation/beyondtext.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Beyond Text-Based Interactions
1 change: 1 addition & 0 deletions _sources/generation/diffusionmodel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# StableAudio - Diffusion-based Model
73 changes: 73 additions & 0 deletions _sources/generation/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Evaluation

We present **Evaluation** before discussing model architecture because, in a generation task, it is crucial to understand the objective metrics that assess diversity, overall quality, relevance, and other aspects of music generation performance. Additionally, recognizing the limitations of these objective metrics and complementing them with subjective evaluation methods is invaluable, as the ultimate assessment of music generation relies on **listening** rather than numbers.

## Listening Test

The subjective listening test is the most effective method to evaluate the performance of music generation models. Drawing from techniques used in speech generation, two commonly applied methods in the subjective listening tests for audio generation are the Mean Opinion Score (MOS) {cite}`musicgenerationtemplate` and MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) {cite}`musicgenerationtemplate`. These methods provide valuable insights into listener perceptions and the overall quality of generated music.

### MOS Test (Mean Opinion Score)

The purpose of the MOS (Mean Opinion Score) test is to evaluate the overall quality of a **single audio stimulus**. This method has been widely used in text-to-speech generation tasks {cite}`musicgenerationtemplate`, as well as in telecommunications and audio codec systems. The setup for a MOS test is cost-effective and straightforward, where testers rate each audio stimulus on a scale from 1 (poor) to 5 (excellent) based on their perception of audio quality or other specific criteria.

One of the strengths of the MOS test is its suitability for situations where the overall subjective quality of a single audio piece needs to be assessed, rather than comparing different models or systems. However, the weaknesses lies in its feedback, which is less sensitive to small quality differences between audio stimuli, and it does not provide insights into the reaons behind the rating.

### MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)

Unlike the MOS test, the MUSHRA test is considered a more advanced method for the detailed evaluation of multiple audio stimuli and systems.

The MUSHRA setup requires testers to listen to several versions of the same audio signal, including a high-quality reference (hidden reference) and a lower-quality version (anchor). Testers rate each stimulus on a continuous scale from 0 (poor) to 100 (excellent) based on perceived audio quality. The MUSHRA test is often designed to evaluate different model ablations, particularly when the differences are subtle.

One of the strengths of MUSHRA is its ability to provide a more detailed and sensitive evaluation of quality differences between audio stimuli and different models. The inclusion of a reference and an anchor helps ensure that participants can offer more accurate responses. However, a notable weakness of MUSHRA is its design complexity; it is more elaborate and time-consuming compared to the MOS test. Additionally, since participants typically evaluate multiple stimuli for each audio sample in a MUSHRA test, this can lead to fatigue during long sessions.

## Audio Diversity and Quality

In addition to subjective listening tests, researchers have developed several objective metrics to evaluate generation performance from a statistical learning perspective. These metrics, originally derived from image generation tasks, include the Inception Score (IS) {cite}`musicgenerationtemplate` and the Fréchet Inception Distance (FID) {cite}`musicgenerationtemplate`.

### Inception Score

The Inception Score (IS) is designed to evaluate the diversity and distinctiveness of outputs generated by generative models. To calculate the Inception Score, a representation model, such as VGGish {cite}`musicgenerationtemplate`, PANN {cite}`musicgenerationtemplate`, or CLAP {cite}`musicgenerationtemplate`, is required to create effective embeddings. The calculation process can be summarized in the following steps:

1. Use a pretrained representation model to obtain deep neural embeddings for each generated output;
2. Calculate the average embedding of all generated outputs;
3. Compute the Inception Score using the Kullback-Leibler (KL) Divergence between the embeddings:

![evaluation_is](../img/generation/evaluation-is.PNG)

The first term of the KL Divergence represents the entropy of the embedding distribution, serving as an effective indicator of classification results. A high IS indicates that each embedding is distinct, as the representation model can confidently assign a unique label to each generated output.

The second term reflects the evenness of the embeddings. When the IS is high, this indicates the desideatum that the generation outputs considers all possible types of audio which reflect a flat distribution.

### Fréchet Inception Distance (FID/FAD)

The Fréchet Inception Distance (FID) {cite}`musicgenerationtemplate`, adapted for the audio domain as the Fréchet Audio Distance (FAD) {cite}`musicgenerationtemplate`, provides a comparable result based on the Inception Score, which was adopted into the audio domain as Fréchet Audio Distance (FAD) {cite}`musicgenerationtemplate`. The calculation process can be summarized in the following steps:

1. Use a pretrained representation model to obtain deep neural embeddings for both the generated outputs and **the data points in the reference set**;
2. Caculate the average embedding of all generated outputs, the average embedding of reference data, the co-variance matrix of all generated outputs, and the co-variance matrix of reference data;
3. The FID/FAD is then computed using these values:

![evaluation_is](../img/generation/evaluation-fid.PNG)

The key difference between IS and FID/FAD is that while IS evaluates the distribution of generated outputs, FID/FAD compares this distribution against that of real data, providing a more comprehensive measure of generation quality.

## Text Relevance

In the text-to-music generation task, it is essential to assess the correspondence between the generated output and the reference textual input to evaluate the performance of multi-modal learning and generation. The CLAP Score {cite}`musicgenerationtemplate` is commonly used for this purpose, leveraging a contrastive language-audio pretraining module:

1. Use the pretrained CLAP model to obtain embeddings for both the generated audio and the reference text;
2. Calculate the dot product or cosine similarity for each text-audio pair and average their scores to derive the final CLAP score.

Additionally, the cosine similarity between the generated audio embedding and the reference audio embedding can also be useful for assessing the audio quality and diversity of the generative model. This metric can be integrated into the calculations for IS and FID/FAD scores.

## Limitation

The limitations of IS, FID/FAD, and the CLAP score can be summarized in three key areas:

1. Embedding Effectiveness: All scores are entirely dependent on the effectiveness of the representation models used. Therefore, selecting an appropriate and effective representation model for calculating embeddings is crucial. Additionally, understanding the limitations of these models can help identify potential corner cases.

2. Distribution-Level Matching: IS and FID/FAD are based on the divergence between the distribution of generated outputs and the average output or the reference data. A high score typically indicates superior quality and diversity, but it can be misleading if the model is able to "cheat" the evaluation. Conversely, a low score does not necessarily indicate poor quality or diversity; the distribution of generated outputs can just be biased to the reference distribution, which may still reflect good quality.

Given these limitations, it is highly recommended to combine both subjective and objective metrics when evaluating music generation models.



34 changes: 33 additions & 1 deletion _sources/generation/intro.md
Original file line number Diff line number Diff line change
@@ -1 +1,33 @@
# Music Generation
# Introduction


## History

The history of music generation dates back to the 1950s {cite}`musicgenerationtemplate` originating as algorithmic composition. By the 1990s, researchers began applying neural networks to symbolic music generation {cite}`musicgenerationtemplate`. Simultaneously, real-time interactive art creation started incorporating music accompaniment, blending generative music with dynamic artistic expression. {cite}`musicgenerationtemplate`.

![music_generation_timeline](../img/generation/timeline.PNG)

Since 2015, the exploration of deep-learning models in symbolic and audio-domain music generation has grown rapidly, as shown in the timeline above.
Researchers at Google applied recurrent neural networks (RNNs) to melody generation, encoding melodic notes as distinct states of pitch and duration to enable predictive modeling `musicgenerationtemplate`.
MidiNet {cite}`musicgenerationtemplate` and Performance RNN further improved the expressive capabilities of generative models, enhancing articulation and expressivenss in generated music.
Style transfer for specific composers was achieved in DeepBach {cite}`musicgenerationtemplate`, which generated Bach-style chorales in work by Sony CSL.
Breakthroughs in deep generative models soon led to three notable symbolic music generation models, namely MuseGAN {cite}`musicgenerationtemplate`, Music Transformer {cite}`musicgenerationtemplate`, and MusicVAE {cite}`musicgenerationtemplate`, emerging almost simultaneously between 2018 and 2020.
These architectures paved the way for subsequent models focused on higher quality, efficiency, and greater control, such as REMI {cite}`musicgenerationtemplate`, SketchNet {cite}`musicgenerationtemplate`, PianotreeVAE {cite}`musicgenerationtemplate`, Multitrack Music Transformer {cite}`musicgenerationtemplate` and others.

Recently, the development of diffusion model {cite}`musicgenerationtemplate` and the masked generative model {cite}`musicgenerationtemplate` have introduced new paradigms for symbolic music generation. Models such as VampNet {cite}`musicgenerationtemplate` and Polyfussion {cite}`musicgenerationtemplate` have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer {cite}`musicgenerationtemplate` leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks.

Compared to the symbolic music domain, music generation in the audio domain, which focuses on directly generating musical signals, initially faced challenges in generation quality due to data limitations, model architecture constraints, and computational bottlenecks.
Early audio generation research primarily focused on speech, exemplified by models like WaveNet {cite}`musicgenerationtemplate` and SampleRNN {cite}`musicgenerationtemplate`. Nsynth {cite}`musicgenerationtemplate`, developed by Google Magenta, marked the first project to synthesize musical signals, which later evolved into DDSP {cite}`musicgenerationtemplate`. OpenAI introduced JukeBox {cite}`musicgenerationtemplate` to generate music directly from the model without relying on synthesis tools from symbolic music notes. SaShiMi {cite}`musicgenerationtemplate` applied the structured state-space model (S4) on music generation.

Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM {cite}`musicgenerationtemplate`, MusicLDM {cite}`musicgenerationtemplate`, Riffusion {cite}`musicgenerationtemplate`, and StableAudio {cite}`musicgenerationtemplate` leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen {cite}`musicgenerationtemplate`, MusicLM {cite}`musicgenerationtemplate`, and MusicGen {cite}`musicgenerationtemplate`. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno {cite}`musicgenerationtemplate` and Udio {cite}`musicgenerationtemplate` also driving this area forward.

In this tutorial, we focus on the audio-domain music generation task, specifically on text-to-music generation. This approach aligns closely with traditional signal-based music understanding, music retrieval tasks, and integrates naturally with language processing, bridging music with natural language inputs.

## Problem Definition

![music_generation_definition](../img/generation/definition.PNG)

The concept of text-to-music generation is illustrated in the figure above, where the model is trained to learn a probability function that maps a given textual input to a music output. The figure includes examples of possible text descriptions: a simple description might consist of keywords like genre, emotion, instrument, or intended purpose. More complex inputs may be full sentences that convey detailed musical information, such as instrument assignments (pink), key and time signature (blue and green), and "clichés" (yellow). The model aims to accurately encode these textual cues and reflect them in the generated music output.


In the follows sections, we will introduce this topic by first introducing the evaluation of the music generation. Then we go through two representative types of text-to-music models, Autoregressive LM-based architecture (MusicGen {cite}`musicgenerationtemplate`), and Non-autoregresstive Diffusion-based architecture (StableAudio {cite}`musicgenerationtemplate`). Finally, we will explore some guiding principles and current limitations of text-to-music models, aiming to enhance the interaction between machine-generated music and human creativity.
36 changes: 36 additions & 0 deletions _sources/generation/lmmodel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# MusicGEN

In this section, we take [MusicGEN](https://musicgen.com/) {cite}`musicgenerationtemplate`. as an example to introduce the auto-regressive modeling of the music generative model via the transformer architecture {cite}`musicgenerationtemplate`.

## Neural Audio Codec

MusicGEN is an auto-regressive text-to-music generative model. The input of the MusicGEN leverages the Encodec {cite}`musicgenerationtemplate` to process music time-domain signals into discrete tokens as neural audio codec tokens.

<center><img alt='generation_encodec' src='../_images/generation/encodec.PNG' width='50%' ></center>

As illustrated in the figure above, the Encodec architecture consists of 1D convolutional and 1D deconvolutional blocks in its encoder and decoder networks. The bottleneck block features a multi-step residual vector quantization (RVQ) mechanism, which converts the continuous latent music embeddings from the encoder into discrete audio tokens. The objective of the decoder is to reconstruct the input time-domain signals from these audio tokens. This reconstruction is trained using a combination of different objectives, including L1 loss in the time-domain signals, L1 loss in the mel-spectrogram signals, L2 loss in the mel-spectrogram signals, and adversarial training with multi-resolution STFT discriminators.

The pretrained Encodec model preprocesses the music time-domain signals into audio tokens, which serve as one part of the input for the MusicGEN model.

## MusicGEN

MusicGEN utilizes a transformer decoder architecture to predict the next audio token based on the preceding audio tokens, as illustrated by the following probability function:

<center><img alt='generation_musicgen_p1' src='../_images/generation/musicgen_p1.PNG' width='50%' ></center>
The cross-entropy loss is the training objective:
<center><img alt='generation_musicgen_l1' src='../_images/generation/musicgen_l1.PNG' width='50%' ></center>

<center><img alt='generation_musicgen_arch' src='../_images/generation/musicgen_arch.PNG' width='50%' ></center>

When incorporating text into the music generation task, MusicGEN employs two methods to condition the text for the music generation target, as illustrated in the figure above:

1. Time-domain Concatenation: utilizing the text tokens generated by the T5 model as prefix tokens preceding the audio tokens, serving as a conditioning mechanism.

2. Cross Attention: forwarding the Keys and Values (K,V) of text tokens, and the Queries (Q) of audio tokens into the cross-attention module of the MusicGEN.

The new probability function is demonstrated as:
<center><img alt='generation_musicgen_p2' src='../_images/generation/musicgen_p2.PNG' width='50%' ></center>




Large diffs are not rendered by default.

Loading

0 comments on commit 55145be

Please sign in to comment.