diff --git a/_sources/generation/beyondtext.md b/_sources/generation/beyondtext.md index 990222a..4025ddd 100644 --- a/_sources/generation/beyondtext.md +++ b/_sources/generation/beyondtext.md @@ -1 +1,11 @@ -# Beyond Text-Based Interactions \ No newline at end of file +# Beyond Text-Based Interactions + +To wrap up this tutorial, there's a sort of elephant in the room that has been cropping up more and more in TTM generation discussions: do we *even want text* as a control method for music generation? This is often paired with a well circulated and unattributed quote: + +> *"Writing about Music is like Dancing about Architecture"* + +To put this more formally, text's main failure as a control medium stems from its lack of specificity and inability to address fine-grained, musically salient details: + +1. **Low Specificity**: as most text caption datasets are generally boosted from metadata, which itself contains only high level information like genre, mood, function, and at best instrument-level tags, the overall mapping from text-to-music becomes a strongly one-to-many function (as an exercise, one can imagine the number of songs that might fit the caption "exciting and upbeat rock music with drums and guitar"). This means that TTM's rarely learn to follow text for anything more than genre-level correlations, as within a given high level mode all the text captions are similar in terms of musical content. +2. **Inability to Address Fine-Grained Details**: Whiile text is great at describing high-level information and even instrument classes, it's overall resolution is quite coarse and fails at describing fine-grained features that change at high resolutions. This is particularly bad for music, as many musical controls (volume, melody, chords, rhythm) require fine temporal-resolutions to describe and control accurately. +3. **Mismatch with Musically Salient Use-Cases**: \ No newline at end of file diff --git a/_sources/generation/diffusionmodel.md b/_sources/generation/diffusionmodel.md index 51a24a3..00069aa 100644 --- a/_sources/generation/diffusionmodel.md +++ b/_sources/generation/diffusionmodel.md @@ -1 +1,64 @@ -# StableAudio - Diffusion-based Model \ No newline at end of file +# Diffusion Model-based Text-to-Music Generation + +While we could dedicate an entire tutorial to discussing how diffusion works in the context of generative audio (and in fact, others have this year at ISMIR!), here we present a condensed review of how diffusion works before jumping into how text conditioning can be built into these models, using [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) as a case study. + +## Diffusion: Continuous Generation through Iterative Refinement +
generation_diff1
+ + +Unlike in the LM-based case where we wish to generate *discrete* tokens $x \in \mathbb{N}$, the goal of diffusion is to generate some *continuous*-valued data $x \in \mathbb{R}$ (which is identical to the more classical models of VAEs and GANs). + +Formally, if our data comes from some distribution $\mathbf{x} \sim p(\mathbf{x})$, then the goal is to learn some model that allows us to sample from this distribution $p_\theta(\mathbf{x}) \approx p(\mathbf{x})$. Practically speaking, in order to sample from the data distribution, we parameterize our model as some generator $G_\theta$ such that: +$$\mathbf{x} = G_\theta(z), \quad z \sim \mathcal{N}(0, \boldsymbol{I}),$$ +i.e. that we learn some model that transforms isotropic gaussian noise into our target data. + +One of the **main** reasons why diffusion models have been so succesful at many generative media tasks over these classical models {cite}`dhariwal2021diffusion` (and why they are more controllable) is their ability of **iterative refinement**. In the above equation, the entire generation process occurs in a single model call. While this is certainly efficient (and many diffusion models have been conceptual reinventing GANs to capitalize on their efficiency), this is *a lot* of work to be done in a single pass of the model, especially for high dimensional data! What would be useful is if we had a way to generate *part* of $x$ in a given model call, and then call the model multiple times to fully generate $x$ (and if you're paying attention, this sounds eerily similar to autoregression). + +In order to build our "multi-step generator", we have to introduce the concept of first *corrupting* our data into noise (note: while this step doesn't fit as cleanly in our condensed diffusion intro, we encourage readers to check out more complete diffusion writeups that motivate the paradigm through a wider lens {cite}`Song2020ScoreBasedGM`). Formally, we'll first adapt our notation to model a *diffusion* process from clean data to noise notated by the *timestep* $0\rightarrow T$, where $\mathbf{x}_0 \sim p_0(\mathbf{x}_0)$ is our clean data (i.e. $\mathbf{x} \sim p(\mathbf{x})$ previously) and $\mathbf{x}_T \sim p_T(\mathbf{x}_T)$ is pure Gaussian noise (i.e. $z$ previously). Then, we can define a diffusion process that gradually turns our clean data $x_0$ into gaussian noise $x_T$ through the stochastic differential equation (SDE): +$$\mathrm{d}\mathbf{x} = f(\mathbf{x}, t)\mathrm{d}t + g(t)\mathrm{d}\boldsymbol{w},$$ +where $\boldsymbol{w}$ is a standard Weiner process (i.e. additive Gaussian noise) $f(\mathbf{x}, t)$ is the *drift* coefficient of $\mathbf{x}_t$ and $g(t)$ is the *diffusion* coefficient. And for clarity, we will use $p_t(\mathbf{x})$ to denote the probability density of $\mathbf{x}_t$. +
generation_diff2
+ + +The reason this is relevant at all is that a clever result from {cite}`anderson1982reverse` allows us to define a *reverse* diffusion process that transforms gaussian noise back into data, given by: +$$\mathrm{d}\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\mathrm{d}t + g(t)\mathrm{d}\bar{\boldsymbol{w}},$$ +where $\bar{\boldsymbol{w}}$ is the reverse-time Weiner proccess and notably, $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is the *score function* of the marginal probability distribution of $\mathbf{x}_t$. In words, the score function defines a direction pointing towards higher density regions of the data distribution, which you can imagine is something like getting the derivative of a 1-D curved path but in high-dimensional space. + +As we now have a way to define the *process* of converting noise to data, we can see that implicitly VAEs/GANs seek to learn a generator the *integrates* the above reverse-time SDE from $T$ to $0$, and thus learn a direct mapping from noise to data. +The strength of diffusion models, however, comes instead from learning a *score model* $s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ and iteratively solving the reverse-time SDE in multiple steps, in a sense walking through the reverse diffusion process at some fixed step size and checking the derivative at each point to determine where we should step next. In this way, diffusion models are able to iteratively refine the model output, gradually removing more and more noise from the starting isotropic Gaussian until our data is clear! +
generation_diff3
+ +If this all sounds like some weird version of how LMs perform autoregression, you'd be thinking about right! Sander Dieleman has a fantastic [blog post](https://sander.ai/2024/09/02/spectral-autoregression.html) on this conceptual similarity, and how one can imagine diffusion being autoregression, but in the *spectral* domain. + +This concludes our intro on diffusion models, and while there's a lot of math here, as long as you understand the core idea that diffusion models approximate the ***gradient*** of the path from noise to data (rather than learning the path itself), you should be fine proceeding through this tutorial! + + +## Representation + +Unlike the autoregressive language model approach, the exact input representation for diffusion-based TTM generation has varied a great deal since diffusion hit the scene in 2021. Below we list them, in *rough* chronological order: + +1. Direct waveform modeling: $\mathbf{x}_0 \in \mathbb{R}^{f_s T \times 1}$, where $f_s$ is the audio's sampling rate and $T$ is the overall time in seconds. In words, we directly perform the diffusion process on the raw audio signal. This input representation is generally ***not*** used, both because the size of raw audio signals can get quite large (just 30 seconds of 44.1 kHz audio is over 1M floats!), and that diffusion just doesn't work as well on raw audio signals (and theres [good reason](https://sander.ai/2024/09/02/spectral-autoregression.html) for this). +2. Direct (mel)-Spectrogram modeling {cite}`zhu2023edmsound, Novack2024Ditto, Novack2024DITTO2DD, wu2023music`: $\mathbf{x}_0 \in \mathbb{R}^{H \times W \times C}$, where $H$ and $W$ are the height and width of the audio (mel)-Spectrogram and $C$ is the number of channels (normally this is just 1, but if using complex spectrograms this can be 2). In this way, TTM diffusion proceeds almost identically to non-latent image diffusion, as we simply treat the audio spectrograms as "images" and run diffusion on these now 2D signals. As we cannot directly convert mel spectrograms back to audio, these models generally train {cite}`Zhu2024MusicHiFiFH` or use an off-the-shelf {cite}`wu2023music` *vocoder* $V(\mathbf{x}_0) : \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{f_s T \times 1}$ to translate from the generated mel spectrogram back to audio. +3. Latent (mel)-Spectrogram modeling {cite}`liu2023audioldm, liu2023audioldm2, chen2023musicldm, forsgren2022riffusion`: $\mathbf{x}_0 \in \mathbb{R}^{D_h \times D_w \times D_c}$, where $D_h, D_w, D_c$ are the ***latent*** height, width, and number of channels after passing the spectrogram through a 2D **autoencoder** (and in general, $D_h \ll H, D_w \ll W$ for efficiency while $D_c > C$). This is perhaps the first design to really break the scene of TTM generation, with {cite}`forsgren2022riffusion` using the existing Stable Diffusion autoencoder and finetuning SD on spectrograms. This thus requires training a separate VAE $\mathcal{D}, \mathcal{E}$ before training the TTM diffusion model. Once trained, sampling from the model involves generating the latent representation with diffusion, passing this through the decoder $\mathcal{D}(\mathbf{x}_0): \mathbb{R}^{D_h \times D_w \times D_c} \rightarrow \mathbb{R}^{H \times W \times C} $ and *then* passing that output through the vocoder $V$. +4. DAC-Style Latent Audio modeling {cite}`stableaudio, evans2024open,Novack2024PrestoDS`: $\mathbf{x}_0 \in \mathbb{R}^{D_T \times 1 \times D_c}$, where $D_T$ is the length of the compressed audio signal, as here we circumvent the vocoder and spectrogram VAE and instead use a **raw-audio VAE** to directly compress the audio into a latent 1D (but multi-channel, as $D_c$ is normally 32/64/96) sequence. Practically, this ends up being nearly identical to the discrete LM codecs like Encodec{cite}`defossez2022highfi` or DAC{cite}`kumar2023high`, with the only difference being that the discrete vector-quantization is replaced with a standard VAE KL regularization, thus giving us **continuous-valued latents** rather than discrete tokens. In fact, much of the rest of the training process and architecture remains the same (i.e. fully convolutional 1D encoder/decoder with snake activations, Multi-Resolution STFT discriminators, etc.). Thus, to sample from the model, we generate the latent representation and directly pass it through the decoder $\mathcal{D}$ to get the audio output. For the rest of the tutorial, we'll focus on this one, as it is what Stable Audio Open uses. + +## Architecture + +Concerning architecture design, most diffusion models have followed 1 of 2 broad categories: U-Nets and Diffusion Transformers (DiTs). In this work, we focus on DiTs, both because most modern diffusion models are adopting this modeling paradigm {cite}`stableaudio,evans2024open,Novack2024PrestoDS` and that DiTs are *much* simpler in terms of code design. A TTM DiT, in general, looks something like this: + +
generation_dit
+ +In words, after the input latent representation is converted to "patches" (i.e. downsampling it further) and the input conditions (i.e. text, more on that later) and timestep/noise level are converted to their corresponding embeddings, a DiT simply is a stack of bidirectional transformer encoder blocks (i.e. like BERT) operating on this latent representation (with the conditioning providing some form of modulation), followed by a final linear and de-patchifying layer to get our prediction. DiTs have a number of nice scaling properties over U-Nets, are able to handle variable length sequences a bit better, and notably allow for much cleaner code given the lack of manual residual down/up-sampling blocks. + +## Conditioning + +The big question is now, how does the text conditioning actually do anything in the model? Before hitting the model, the text prompt $\mathbf{c}_{\textrm{text}}$ first has to be converted from a string to some numerical embedding, which we'll call $\mathbf{e}_{\textrm{text}} = \textrm{Emb}(\mathbf{c}_{\textrm{text}})$, where $\textrm{Emb}$ is some embedding extraction function. In many cases, $\textrm{Emb}$ uses a pre-trained text backbone (such as CLAP or T5), followed by 1 or more linear layers to project the embedding to the correct size. After embedding, $\mathbf{e}_{\textrm{text}}$ is either a global embedding $\mathbb{R}^{d}$ or sequence level embedding $\mathbb{R}^{d \times \ell}$ where $d$ is the hidden dimension of the DiT and where $\ell$ denotes the token length of the text embedding (i.e. the text embedding can be extracted per-token, as is the case with T5). + +There are now multiple ways $\mathbf{e}_{\textrm{text}}$ can hit the main diffusion latent inside the model (which can all be combined), so we'll go over a few of them: + +1. **Time-Domain Concatenation** (aka In-Context Conditioning or Prefix Conditioning): Here, we simply append the text condition to the diffusion latent sequence to get some new latent $\hat{\mathbf{x}} = [\mathbf{x}, \mathbf{e}_{\textrm{text}}] \in \mathbb{R}^{(D_T + \ell) \times 1 \times d}$, where $[\cdot]$ is the concatenation operation along the *time* axis (i.e. the sequence gets longer), and remove this extra token(s) after all the DiT blocks. In this way, the text operates on the diffusion latent through the DiT's self-attention blocks only, and causes minimal to moderate slowdowns depending on the length of the text embedding. +2. **Channel-Wise Concatenation**: Here, $\mathbf{e}_{\textrm{text}}$ is first project to have the same *sequence* length as the main diffusion latent, and then is concatenated along the *channel dimension* $\hat{\mathbf{x}} = [\mathbf{x}, \textrm{Proj}(\mathbf{e}_{\textrm{text}})] \in \mathbb{R}^{(D_T) \times 1 \times 2d}$ (i.e. the sequence gets *deeper* in a sense). This is generally not used that much for text conditioning (but is great for other conditions), as it imbues the text with a sort of temporality that does not exist for global captions. +3. **Cross-Attention**: Here, we add additional cross-attention layers interleaved with the self-attention layers inside each DiT block, where the diffusion latent direct attents to $\mathbf{e}_{\textrm{text}}$. This perhaps offers the best control (and is what Stable Audio Open uses), at the cost of the most added compute given the quadratic cost of each cross-attention layer. +4. **Adaptive Layer-Norm (AdaLN)**: Here, the layer-norms in each DiT block are augmented with shift, scale, and gate parameters (one for each index of the hidden dimension) that are learned from $\mathbf{e}_{\textrm{text}}$ through a small MLP: $\gamma_{\textrm{shift}}, \gamma_{\textrm{scale}}, \gamma_{\textrm{gate}} = \textrm{MLP}(\mathbf{e}_{\textrm{text}})$. This adds the least computation to the model, and is what the original DiT works uses {cite}`peebles2023scalable`. Note that in spirit, this is practically identical to the Feature-wise Linear Modulation (FiLM) layers used in MusicLDM {cite}`chen2023musicldm`. + +
generation_conds
\ No newline at end of file diff --git a/_sources/generation/intro.md b/_sources/generation/intro.md index d65d031..256c775 100644 --- a/_sources/generation/intro.md +++ b/_sources/generation/intro.md @@ -14,12 +14,12 @@ Style transfer for specific composers was achieved in DeepBach {cite}`musicgener Breakthroughs in deep generative models soon led to three notable symbolic music generation models, namely MuseGAN {cite}`musicgenerationtemplate`, Music Transformer {cite}`musicgenerationtemplate`, and MusicVAE {cite}`musicgenerationtemplate`, emerging almost simultaneously between 2018 and 2020. These architectures paved the way for subsequent models focused on higher quality, efficiency, and greater control, such as REMI {cite}`musicgenerationtemplate`, SketchNet {cite}`musicgenerationtemplate`, PianotreeVAE {cite}`musicgenerationtemplate`, Multitrack Music Transformer {cite}`musicgenerationtemplate` and others. -Recently, the development of diffusion model {cite}`musicgenerationtemplate` and the masked generative model {cite}`musicgenerationtemplate` have introduced new paradigms for symbolic music generation. Models such as VampNet {cite}`musicgenerationtemplate` and Polyfussion {cite}`musicgenerationtemplate` have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer {cite}`musicgenerationtemplate` leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks. +Recently, the development of diffusion model {cite}`musicgenerationtemplate` and the masked generative model {cite}`musicgenerationtemplate` have introduced new paradigms for symbolic music generation. Models such as Polyfussion {cite}`musicgenerationtemplate` have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer {cite}`musicgenerationtemplate` leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks. Compared to the symbolic music domain, music generation in the audio domain, which focuses on directly generating musical signals, initially faced challenges in generation quality due to data limitations, model architecture constraints, and computational bottlenecks. Early audio generation research primarily focused on speech, exemplified by models like WaveNet {cite}`musicgenerationtemplate` and SampleRNN {cite}`musicgenerationtemplate`. Nsynth {cite}`musicgenerationtemplate`, developed by Google Magenta, marked the first project to synthesize musical signals, which later evolved into DDSP {cite}`musicgenerationtemplate`. OpenAI introduced JukeBox {cite}`musicgenerationtemplate` to generate music directly from the model without relying on synthesis tools from symbolic music notes. SaShiMi {cite}`musicgenerationtemplate` applied the structured state-space model (S4) on music generation. -Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM {cite}`musicgenerationtemplate`, MusicLDM {cite}`musicgenerationtemplate`, Riffusion {cite}`musicgenerationtemplate`, and StableAudio {cite}`musicgenerationtemplate` leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen {cite}`musicgenerationtemplate`, MusicLM {cite}`musicgenerationtemplate`, and MusicGen {cite}`musicgenerationtemplate`. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno {cite}`musicgenerationtemplate` and Udio {cite}`musicgenerationtemplate` also driving this area forward. +Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM {cite}`musicgenerationtemplate`, MusicLDM {cite}`musicgenerationtemplate`, Riffusion {cite}`musicgenerationtemplate`, and StableAudio {cite}`musicgenerationtemplate` leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen {cite}`musicgenerationtemplate`, MusicLM {cite}`musicgenerationtemplate`, VampNet {cite}`musicgenerationtemplate`, and MusicGen {cite}`musicgenerationtemplate`. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno {cite}`musicgenerationtemplate` and Udio {cite}`musicgenerationtemplate` also driving this area forward. In this tutorial, we focus on the audio-domain music generation task, specifically on text-to-music generation. This approach aligns closely with traditional signal-based music understanding, music retrieval tasks, and integrates naturally with language processing, bridging music with natural language inputs. diff --git a/bibliography.html b/bibliography.html index 2be8b06..46dc2ef 100644 --- a/bibliography.html +++ b/bibliography.html @@ -207,7 +207,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • @@ -491,6 +491,10 @@

    Bibliography[DMP18]

    Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.

    +
    +[DCSA22] +

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438, 2022.

    +
    [ELBMG07]

    Douglas Eck, Paul Lamere, Thierry Bertin-Mahieux, and Stephen Green. Automatic generation of social tags for music recommendation. Advances in neural information processing systems, 2007.

    diff --git a/description/datasets.html b/description/datasets.html index c4ee061..b09e379 100644 --- a/description/datasets.html +++ b/description/datasets.html @@ -208,7 +208,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • diff --git a/description/evaluation.html b/description/evaluation.html index d92edd7..5b577cb 100644 --- a/description/evaluation.html +++ b/description/evaluation.html @@ -210,7 +210,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • diff --git a/description/models.html b/description/models.html index f17bea2..60cd225 100644 --- a/description/models.html +++ b/description/models.html @@ -210,7 +210,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • diff --git a/description/tasks.html b/description/tasks.html index 5f2c7ef..0dcef15 100644 --- a/description/tasks.html +++ b/description/tasks.html @@ -210,7 +210,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • diff --git a/generation/beyondtext.html b/generation/beyondtext.html index 6e80467..51d0fbb 100644 --- a/generation/beyondtext.html +++ b/generation/beyondtext.html @@ -61,7 +61,7 @@ - + @@ -208,7 +208,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • @@ -418,6 +418,16 @@

    Beyond Text-Based Interactions

    Beyond Text-Based Interactions#

    +

    To wrap up this tutorial, there’s a sort of elephant in the room that has been cropping up more and more in TTM generation discussions: do we even want text as a control method for music generation? This is often paired with a well circulated and unattributed quote:

    +
    +

    “Writing about Music is like Dancing about Architecture”

    +
    +

    To put this more formally, text’s main failure as a control medium stems from its lack of specificity and inability to address fine-grained, musically salient details:

    +
      +
    1. Low Specificity: as most text caption datasets are generally boosted from metadata, which itself contains only high level information like genre, mood, function, and at best instrument-level tags, the overall mapping from text-to-music becomes a strongly one-to-many function (as an exercise, one can imagine the number of songs that might fit the caption “exciting and upbeat rock music with drums and guitar”). This means that TTM’s rarely learn to follow text for anything more than genre-level correlations, as within a given high level mode all the text captions are similar in terms of musical content.

    2. +
    3. Inability to Address Fine-Grained Details: Whiile text is great at describing high-level information and even instrument classes, it’s overall resolution is quite coarse and fails at describing fine-grained features that change at high resolutions. This is particularly bad for music, as many musical controls (volume, melody, chords, rhythm) require fine temporal-resolutions to describe and control accurately.

    4. +
    5. Mismatch with Musically Salient Use-Cases:

    6. +
    + + @@ -208,7 +210,7 @@
  • Introduction
  • Evaluation
  • MusicGEN
  • -
  • StableAudio - Diffusion-based Model
  • +
  • Diffusion Model-based Text-to-Music Generation
  • Beyond Text-Based Interactions
  • Code Tutoiral
  • @@ -391,7 +393,9 @@ `); - +
    @@ -402,11 +406,22 @@
    -

    StableAudio - Diffusion-based Model

    +

    Diffusion Model-based Text-to-Music Generation

    @@ -416,8 +431,58 @@

    StableAudio - Diffusion-based Model

    -
    -

    StableAudio - Diffusion-based Model#

    +
    +

    Diffusion Model-based Text-to-Music Generation#

    +

    While we could dedicate an entire tutorial to discussing how diffusion works in the context of generative audio (and in fact, others have this year at ISMIR!), here we present a condensed review of how diffusion works before jumping into how text conditioning can be built into these models, using Stable Audio Open as a case study.

    +
    +

    Diffusion: Continuous Generation through Iterative Refinement#

    +
    generation_diff1
    +

    Unlike in the LM-based case where we wish to generate discrete tokens \(x \in \mathbb{N}\), the goal of diffusion is to generate some continuous-valued data \(x \in \mathbb{R}\) (which is identical to the more classical models of VAEs and GANs).

    +

    Formally, if our data comes from some distribution \(\mathbf{x} \sim p(\mathbf{x})\), then the goal is to learn some model that allows us to sample from this distribution \(p_\theta(\mathbf{x}) \approx p(\mathbf{x})\). Practically speaking, in order to sample from the data distribution, we parameterize our model as some generator \(G_\theta\) such that: +$\(\mathbf{x} = G_\theta(z), \quad z \sim \mathcal{N}(0, \boldsymbol{I}),\)$ +i.e. that we learn some model that transforms isotropic gaussian noise into our target data.

    +

    One of the main reasons why diffusion models have been so succesful at many generative media tasks over these classical models [] (and why they are more controllable) is their ability of iterative refinement. In the above equation, the entire generation process occurs in a single model call. While this is certainly efficient (and many diffusion models have been conceptual reinventing GANs to capitalize on their efficiency), this is a lot of work to be done in a single pass of the model, especially for high dimensional data! What would be useful is if we had a way to generate part of \(x\) in a given model call, and then call the model multiple times to fully generate \(x\) (and if you’re paying attention, this sounds eerily similar to autoregression).

    +

    In order to build our “multi-step generator”, we have to introduce the concept of first corrupting our data into noise (note: while this step doesn’t fit as cleanly in our condensed diffusion intro, we encourage readers to check out more complete diffusion writeups that motivate the paradigm through a wider lens []). Formally, we’ll first adapt our notation to model a diffusion process from clean data to noise notated by the timestep \(0\rightarrow T\), where \(\mathbf{x}_0 \sim p_0(\mathbf{x}_0)\) is our clean data (i.e. \(\mathbf{x} \sim p(\mathbf{x})\) previously) and \(\mathbf{x}_T \sim p_T(\mathbf{x}_T)\) is pure Gaussian noise (i.e. \(z\) previously). Then, we can define a diffusion process that gradually turns our clean data \(x_0\) into gaussian noise \(x_T\) through the stochastic differential equation (SDE): +$\(\mathrm{d}\mathbf{x} = f(\mathbf{x}, t)\mathrm{d}t + g(t)\mathrm{d}\boldsymbol{w},\)\( +where \)\boldsymbol{w}\( is a standard Weiner process (i.e. additive Gaussian noise) \)f(\mathbf{x}, t)\( is the *drift* coefficient of \)\mathbf{x}_t\( and \)g(t)\( is the *diffusion* coefficient. And for clarity, we will use \)p_t(\mathbf{x})\( to denote the probability density of \)\mathbf{x}_t$.

    +
    generation_diff2
    +

    The reason this is relevant at all is that a clever result from [] allows us to define a reverse diffusion process that transforms gaussian noise back into data, given by: +$\(\mathrm{d}\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\mathrm{d}t + g(t)\mathrm{d}\bar{\boldsymbol{w}},\)\( +where \)\bar{\boldsymbol{w}}\( is the reverse-time Weiner proccess and notably, \)\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\( is the *score function* of the marginal probability distribution of \)\mathbf{x}_t$. In words, the score function defines a direction pointing towards higher density regions of the data distribution, which you can imagine is something like getting the derivative of a 1-D curved path but in high-dimensional space.

    +

    As we now have a way to define the process of converting noise to data, we can see that implicitly VAEs/GANs seek to learn a generator the integrates the above reverse-time SDE from \(T\) to \(0\), and thus learn a direct mapping from noise to data. +The strength of diffusion models, however, comes instead from learning a score model \(s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\) and iteratively solving the reverse-time SDE in multiple steps, in a sense walking through the reverse diffusion process at some fixed step size and checking the derivative at each point to determine where we should step next. In this way, diffusion models are able to iteratively refine the model output, gradually removing more and more noise from the starting isotropic Gaussian until our data is clear!

    +
    generation_diff3
    +

    If this all sounds like some weird version of how LMs perform autoregression, you’d be thinking about right! Sander Dieleman has a fantastic blog post on this conceptual similarity, and how one can imagine diffusion being autoregression, but in the spectral domain.

    +

    This concludes our intro on diffusion models, and while there’s a lot of math here, as long as you understand the core idea that diffusion models approximate the gradient of the path from noise to data (rather than learning the path itself), you should be fine proceeding through this tutorial!

    +
    +
    +

    Representation#

    +

    Unlike the autoregressive language model approach, the exact input representation for diffusion-based TTM generation has varied a great deal since diffusion hit the scene in 2021. Below we list them, in rough chronological order:

    +
      +
    1. Direct waveform modeling: \(\mathbf{x}_0 \in \mathbb{R}^{f_s T \times 1}\), where \(f_s\) is the audio’s sampling rate and \(T\) is the overall time in seconds. In words, we directly perform the diffusion process on the raw audio signal. This input representation is generally not used, both because the size of raw audio signals can get quite large (just 30 seconds of 44.1 kHz audio is over 1M floats!), and that diffusion just doesn’t work as well on raw audio signals (and theres good reason for this).

    2. +
    3. Direct (mel)-Spectrogram modeling [WDWB23]: \(\mathbf{x}_0 \in \mathbb{R}^{H \times W \times C}\), where \(H\) and \(W\) are the height and width of the audio (mel)-Spectrogram and \(C\) is the number of channels (normally this is just 1, but if using complex spectrograms this can be 2). In this way, TTM diffusion proceeds almost identically to non-latent image diffusion, as we simply treat the audio spectrograms as “images” and run diffusion on these now 2D signals. As we cannot directly convert mel spectrograms back to audio, these models generally train [] or use an off-the-shelf [WDWB23] vocoder \(V(\mathbf{x}_0) : \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{f_s T \times 1}\) to translate from the generated mel spectrogram back to audio.

    4. +
    5. Latent (mel)-Spectrogram modeling [CWL+23]: \(\mathbf{x}_0 \in \mathbb{R}^{D_h \times D_w \times D_c}\), where \(D_h, D_w, D_c\) are the latent height, width, and number of channels after passing the spectrogram through a 2D autoencoder (and in general, \(D_h \ll H, D_w \ll W\) for efficiency while \(D_c > C\)). This is perhaps the first design to really break the scene of TTM generation, with [] using the existing Stable Diffusion autoencoder and finetuning SD on spectrograms. This thus requires training a separate VAE \(\mathcal{D}, \mathcal{E}\) before training the TTM diffusion model. Once trained, sampling from the model involves generating the latent representation with diffusion, passing this through the decoder \(\mathcal{D}(\mathbf{x}_0): \mathbb{R}^{D_h \times D_w \times D_c} \rightarrow \mathbb{R}^{H \times W \times C} \) and then passing that output through the vocoder \(V\).

    6. +
    7. DAC-Style Latent Audio modeling []: \(\mathbf{x}_0 \in \mathbb{R}^{D_T \times 1 \times D_c}\), where \(D_T\) is the length of the compressed audio signal, as here we circumvent the vocoder and spectrogram VAE and instead use a raw-audio VAE to directly compress the audio into a latent 1D (but multi-channel, as \(D_c\) is normally 32/64/96) sequence. Practically, this ends up being nearly identical to the discrete LM codecs like Encodec[DCSA22] or DAC[], with the only difference being that the discrete vector-quantization is replaced with a standard VAE KL regularization, thus giving us continuous-valued latents rather than discrete tokens. In fact, much of the rest of the training process and architecture remains the same (i.e. fully convolutional 1D encoder/decoder with snake activations, Multi-Resolution STFT discriminators, etc.). Thus, to sample from the model, we generate the latent representation and directly pass it through the decoder \(\mathcal{D}\) to get the audio output. For the rest of the tutorial, we’ll focus on this one, as it is what Stable Audio Open uses.

    8. +
    +
    +
    +

    Architecture#

    +

    Concerning architecture design, most diffusion models have followed 1 of 2 broad categories: U-Nets and Diffusion Transformers (DiTs). In this work, we focus on DiTs, both because most modern diffusion models are adopting this modeling paradigm [] and that DiTs are much simpler in terms of code design. A TTM DiT, in general, looks something like this:

    +
    generation_dit
    +

    In words, after the input latent representation is converted to “patches” (i.e. downsampling it further) and the input conditions (i.e. text, more on that later) and timestep/noise level are converted to their corresponding embeddings, a DiT simply is a stack of bidirectional transformer encoder blocks (i.e. like BERT) operating on this latent representation (with the conditioning providing some form of modulation), followed by a final linear and de-patchifying layer to get our prediction. DiTs have a number of nice scaling properties over U-Nets, are able to handle variable length sequences a bit better, and notably allow for much cleaner code given the lack of manual residual down/up-sampling blocks.

    +
    +
    +

    Conditioning#

    +

    The big question is now, how does the text conditioning actually do anything in the model? Before hitting the model, the text prompt \(\mathbf{c}_{\textrm{text}}\) first has to be converted from a string to some numerical embedding, which we’ll call \(\mathbf{e}_{\textrm{text}} = \textrm{Emb}(\mathbf{c}_{\textrm{text}})\), where \(\textrm{Emb}\) is some embedding extraction function. In many cases, \(\textrm{Emb}\) uses a pre-trained text backbone (such as CLAP or T5), followed by 1 or more linear layers to project the embedding to the correct size. After embedding, \(\mathbf{e}_{\textrm{text}}\) is either a global embedding \(\mathbb{R}^{d}\) or sequence level embedding \(\mathbb{R}^{d \times \ell}\) where \(d\) is the hidden dimension of the DiT and where \(\ell\) denotes the token length of the text embedding (i.e. the text embedding can be extracted per-token, as is the case with T5).

    +

    There are now multiple ways \(\mathbf{e}_{\textrm{text}}\) can hit the main diffusion latent inside the model (which can all be combined), so we’ll go over a few of them:

    +
      +
    1. Time-Domain Concatenation (aka In-Context Conditioning or Prefix Conditioning): Here, we simply append the text condition to the diffusion latent sequence to get some new latent \(\hat{\mathbf{x}} = [\mathbf{x}, \mathbf{e}_{\textrm{text}}] \in \mathbb{R}^{(D_T + \ell) \times 1 \times d}\), where \([\cdot]\) is the concatenation operation along the time axis (i.e. the sequence gets longer), and remove this extra token(s) after all the DiT blocks. In this way, the text operates on the diffusion latent through the DiT’s self-attention blocks only, and causes minimal to moderate slowdowns depending on the length of the text embedding.

    2. +
    3. Channel-Wise Concatenation: Here, \(\mathbf{e}_{\textrm{text}}\) is first project to have the same sequence length as the main diffusion latent, and then is concatenated along the channel dimension \(\hat{\mathbf{x}} = [\mathbf{x}, \textrm{Proj}(\mathbf{e}_{\textrm{text}})] \in \mathbb{R}^{(D_T) \times 1 \times 2d}\) (i.e. the sequence gets deeper in a sense). This is generally not used that much for text conditioning (but is great for other conditions), as it imbues the text with a sort of temporality that does not exist for global captions.

    4. +
    5. Cross-Attention: Here, we add additional cross-attention layers interleaved with the self-attention layers inside each DiT block, where the diffusion latent direct attents to \(\mathbf{e}_{\textrm{text}}\). This perhaps offers the best control (and is what Stable Audio Open uses), at the cost of the most added compute given the quadratic cost of each cross-attention layer.

    6. +
    7. Adaptive Layer-Norm (AdaLN): Here, the layer-norms in each DiT block are augmented with shift, scale, and gate parameters (one for each index of the hidden dimension) that are learned from \(\mathbf{e}_{\textrm{text}}\) through a small MLP: \(\gamma_{\textrm{shift}}, \gamma_{\textrm{scale}}, \gamma_{\textrm{gate}} = \textrm{MLP}(\mathbf{e}_{\textrm{text}})\). This adds the least computation to the model, and is what the original DiT works uses []. Note that in spirit, this is practically identical to the Feature-wise Linear Modulation (FiLM) layers used in MusicLDM [CWL+23].

    8. +
    +
    generation_conds
    +