Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Nov 25, 2024
1 parent 075e9be commit 0a85daf
Show file tree
Hide file tree
Showing 25 changed files with 52 additions and 296 deletions.
Binary file added _images/conds copy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/diff1 copy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/diff2 copy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/diff3 copy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/dit copy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/encodec.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/evaluation-fid copy.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/evaluation-is copy.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/musicgen_arch.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/musicgen_l1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/musicgen_p1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/musicgen_p2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/timeline copy.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions _sources/conclusion/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,5 @@ We're delighted that you've studied these topics with us. Have you achieved your
As a sweet dessert, we've prepared two exciting future directions in the following pages. Don't miss these delightful treats!

Best wishes,

SeungHeon, Ilaria, Zachary, JongWook, Ke
2 changes: 1 addition & 1 deletion _sources/generation/code.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
},
"outputs": [],
"source": [
"!pip install torch torchaudio torchvision stable-audio-tools einops"
"# !pip install torch torchaudio torchvision stable-audio-tools einops"
]
},
{
Expand Down
10 changes: 5 additions & 5 deletions _sources/generation/diffusionmodel.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ While we could dedicate an entire tutorial to discussing how diffusion works in

## Diffusion: Continuous Generation through Iterative Refinement

```{figure} ../img/generation/diff1.png
```{figure} ./img/diff1.png
---
name: Generating Continuous-Valued Data
---
Expand Down Expand Up @@ -34,7 +34,7 @@ $$

where $\boldsymbol{w}$ is a standard Weiner process (i.e. additive Gaussian noise) $f(\mathbf{x}, t)$ is the *drift* coefficient of $\mathbf{x}_t$ and $g(t)$ is the *diffusion* coefficient. If SDEs seem hard to parse, the important thing to take away is that the above equation defines a process $0\rightarrow T$ that gradually adds noise to our data, until there is only noise left. And for clarity, we will use $p_t(\mathbf{x})$ to denote the marginal probability density of $\mathbf{x}_t$.

```{figure} ../img/generation/diff2.png
```{figure} ./img/diff2.png
---
name: Forward Diffusion Process
---
Expand All @@ -52,7 +52,7 @@ where $\bar{\boldsymbol{w}}$ is the reverse-time Weiner proccess and notably, $\
As we now have a way to define the *process* of converting noise to data, we can see that implicitly VAEs/GANs seek to learn a generator the *integrates* the above reverse-time SDE from $T$ to $0$, and thus learn a direct mapping from noise to data.
The strength of diffusion models, however, comes instead from learning a *score model* $s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x})$. This way, diffusion models iteratively solve the reverse-time SDE in multiple steps, in a sense walking through the reverse diffusion process at some fixed step size and checking the derivative at each point to determine where we should step next. In this way, diffusion models are able to iteratively refine the model output, gradually removing more and more noise from the starting isotropic Gaussian until our data is clear!

```{figure} ../img/generation/diff3.png
```{figure} ./img/diff3.png
---
name: Diffusion Models vs. VAEs/GANs
---
Expand All @@ -77,7 +77,7 @@ Unlike the autoregressive language model approach, the exact input representatio

Concerning architecture design, most diffusion models have followed 1 of 2 broad categories: U-Nets and Diffusion Transformers (DiTs). In this work, we focus on DiTs, both because most modern diffusion models are adopting this modeling paradigm {cite}`stableaudio,evans2024open,Novack2024PrestoDS` and that DiTs are *much* simpler in terms of code design. A TTM DiT, in general, looks something like this:

```{figure} ../img/generation/dit.png
```{figure} ./img/dit.png
---
name: DiT Architecture
---
Expand All @@ -98,7 +98,7 @@ There are now multiple ways $\mathbf{e}_{\textrm{text}}$ can hit the main diffu
4. **Adaptive Layer-Norm (AdaLN)**: Here, the layer-norms in each DiT block are augmented with shift, scale, and gate parameters (one for each index of the hidden dimension) that are learned from $\mathbf{e}_{\textrm{text}}$ through a small MLP: $\gamma_{\textrm{shift}}, \gamma_{\textrm{scale}}, \gamma_{\textrm{gate}} = \textrm{MLP}(\mathbf{e}_{\textrm{text}})$. This adds the least computation to the model, and is what the original DiT works uses {cite}`peebles2023scalable`. Note that in spirit, this is practically identical to the Feature-wise Linear Modulation (FiLM) layers used in MusicLDM {cite}`chen2023musicldm`. These shift, scale, and gate parameters can also be zero-initialized such that each block is essentially initialized as the identity function, which is what "AdaLN-Zero" refers to.


```{figure} ../img/generation/conds.png
```{figure} ./img/conds.png
---
name: Types of DiT conditioning mechanism,s
---
Expand Down
4 changes: 2 additions & 2 deletions _sources/generation/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ The Inception Score (IS) is designed to evaluate the diversity and distinctivene
2. Calculate the average embedding of all generated outputs;
3. Compute the Inception Score using the Kullback-Leibler (KL) Divergence between the embeddings:

![evaluation_is](../img/generation/evaluation-is.PNG)
![evaluation_is](./img/evaluation-is.PNG)

The first term of the KL Divergence represents the entropy of the embedding distribution, serving as an effective indicator of classification results. A high IS indicates that each embedding is distinct, as the representation model can confidently assign a unique label to each generated output.

Expand All @@ -46,7 +46,7 @@ The Fréchet Inception Distance (FID) {cite}`DBLP:conf/nips/HeuselRUNH17`, adapt
2. Caculate the average embedding of all generated outputs, the average embedding of reference data, the co-variance matrix of all generated outputs, and the co-variance matrix of reference data;
3. The FID/FAD is then computed using these values:

![evaluation_is](../img/generation/evaluation-fid.PNG)
![evaluation_is](./img/evaluation-fid.PNG)

The key difference between IS and FID/FAD is that while IS evaluates the distribution of generated outputs, FID/FAD compares this distribution against that of real data, providing a more comprehensive measure of generation quality.

Expand Down
4 changes: 2 additions & 2 deletions _sources/generation/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

The history of music generation dates back to the 1970s {cite}`hiller1979emc` originating as algorithmic composition. By the 1990s, researchers began applying neural networks to symbolic music generation {cite}`todd1989aca`. Simultaneously, real-time interactive art creation started incorporating music accompaniment, blending generative music with dynamic artistic expression. {cite}`rovan1997igms`.

![music_generation_timeline](../img/generation/timeline.PNG)
![music_generation_timeline](./img/timeline.PNG)

Since 2015, the exploration of deep-learning models in symbolic and audio-domain music generation has grown rapidly, as shown in the timeline above.
Researchers at Google applied recurrent neural networks (RNNs) to melody generation, encoding melodic notes as distinct states of pitch and duration to enable predictive modeling {cite}`performancernn2017`.
Expand All @@ -25,7 +25,7 @@ In this tutorial, we focus on the audio-domain music generation task, specifical

## Problem Definition

![music_generation_definition](../img/generation/definition.PNG)
![music_generation_definition](./img/definition.PNG)

The concept of text-to-music generation is illustrated in the figure above, where the model is trained to learn a probability function that maps a given textual input to a music output. The figure includes examples of possible text descriptions: a simple description might consist of keywords like genre, emotion, instrument, or intended purpose. More complex inputs may be full sentences that convey detailed musical information, such as instrument assignments (pink), key and time signature (blue and green), and "clichés" (yellow). The model aims to accurately encode these textual cues and reflect them in the generated music output.

Expand Down
10 changes: 5 additions & 5 deletions _sources/generation/lmmodel.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In this section, we take [MusicGEN](https://musicgen.com/) {cite}`copet2024simpl

MusicGEN is an auto-regressive text-to-music generative model. The input of the MusicGEN leverages the Encodec {cite}`DBLP:journals/tmlr/DefossezCSA23` to process music time-domain signals into discrete tokens as neural audio codec tokens.

<center><img alt='generation_encodec' src='../_images/generation/encodec.PNG' width='50%' ></center>
<center><img alt='generation_encodec' src='../_images/encodec.PNG' width='50%' ></center>

As illustrated in the figure above, the Encodec architecture consists of 1D convolutional and 1D deconvolutional blocks in its encoder and decoder networks. The bottleneck block features a multi-step residual vector quantization (RVQ) mechanism, which converts the continuous latent music embeddings from the encoder into discrete audio tokens. The objective of the decoder is to reconstruct the input time-domain signals from these audio tokens. This reconstruction is trained using a combination of different objectives, including L1 loss in the time-domain signals, L1 loss in the mel-spectrogram signals, L2 loss in the mel-spectrogram signals, and adversarial training with multi-resolution STFT discriminators.

Expand All @@ -16,11 +16,11 @@ The pretrained Encodec model preprocesses the music time-domain signals into aud

MusicGEN utilizes a transformer decoder architecture to predict the next audio token based on the preceding audio tokens, as illustrated by the following probability function:

<center><img alt='generation_musicgen_p1' src='../_images/generation/musicgen_p1.PNG' width='50%' ></center>
<center><img alt='generation_musicgen_p1' src='../_images/musicgen_p1.PNG' width='50%' ></center>
The cross-entropy loss is the training objective:
<center><img alt='generation_musicgen_l1' src='../_images/generation/musicgen_l1.PNG' width='50%' ></center>
<center><img alt='generation_musicgen_l1' src='../_images/musicgen_l1.PNG' width='50%' ></center>

<center><img alt='generation_musicgen_arch' src='../_images/generation/musicgen_arch.PNG' width='50%' ></center>
<center><img alt='generation_musicgen_arch' src='../_images/musicgen_arch.PNG' width='50%' ></center>

When incorporating text into the music generation task, MusicGEN employs two methods to condition the text for the music generation target, as illustrated in the figure above:

Expand All @@ -29,7 +29,7 @@ When incorporating text into the music generation task, MusicGEN employs two met
2. Cross Attention: forwarding the Keys and Values (K,V) of text tokens, and the Queries (Q) of audio tokens into the cross-attention module of the MusicGEN.

The new probability function is demonstrated as:
<center><img alt='generation_musicgen_p2' src='../_images/generation/musicgen_p2.PNG' width='50%' ></center>
<center><img alt='generation_musicgen_p2' src='../_images/musicgen_p2.PNG' width='50%' ></center>



Expand Down
4 changes: 2 additions & 2 deletions conclusion/intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -433,8 +433,8 @@ <h1>Conclusion<a class="headerlink" href="#conclusion" title="Link to this headi
<p>In Chapter 5, we reviewed two prominent text-to-music generation methods: discrete token-based language models and diffusion-based generative models operating in continuous space. We also conducted an in-depth discussion about the importance of evaluation and current challenges in evaluation methodologies.</p>
<p>We’re delighted that you’ve studied these topics with us. Have you achieved your learning goals? Were your questions answered? We hope we’ve succeeded in our aims: making these complex topics more accessible to newcomers, providing practical solutions for data challenges, and bridging the gap between academic research and practical applications. Please don’t hesitate to reach out if you have any questions or feedback.</p>
<p>As a sweet dessert, we’ve prepared two exciting future directions in the following pages. Don’t miss these delightful treats!</p>
<p>Best wishes,
SeungHeon, Ilaria, Zachary, JongWook, Ke</p>
<p>Best wishes,</p>
<p>SeungHeon, Ilaria, Zachary, JongWook, Ke</p>
</section>

<script type="text/x-thebe-config">
Expand Down
2 changes: 1 addition & 1 deletion description/code.html

Large diffs are not rendered by default.

Loading

0 comments on commit 0a85daf

Please sign in to comment.