diff --git a/_sources/generation/beyondtext.md b/_sources/generation/beyondtext.md
index 990222a..4025ddd 100644
--- a/_sources/generation/beyondtext.md
+++ b/_sources/generation/beyondtext.md
@@ -1 +1,11 @@
-# Beyond Text-Based Interactions
\ No newline at end of file
+# Beyond Text-Based Interactions
+
+To wrap up this tutorial, there's a sort of elephant in the room that has been cropping up more and more in TTM generation discussions: do we *even want text* as a control method for music generation? This is often paired with a well circulated and unattributed quote:
+
+> *"Writing about Music is like Dancing about Architecture"*
+
+To put this more formally, text's main failure as a control medium stems from its lack of specificity and inability to address fine-grained, musically salient details:
+
+1. **Low Specificity**: as most text caption datasets are generally boosted from metadata, which itself contains only high level information like genre, mood, function, and at best instrument-level tags, the overall mapping from text-to-music becomes a strongly one-to-many function (as an exercise, one can imagine the number of songs that might fit the caption "exciting and upbeat rock music with drums and guitar"). This means that TTM's rarely learn to follow text for anything more than genre-level correlations, as within a given high level mode all the text captions are similar in terms of musical content.
+2. **Inability to Address Fine-Grained Details**: Whiile text is great at describing high-level information and even instrument classes, it's overall resolution is quite coarse and fails at describing fine-grained features that change at high resolutions. This is particularly bad for music, as many musical controls (volume, melody, chords, rhythm) require fine temporal-resolutions to describe and control accurately.
+3. **Mismatch with Musically Salient Use-Cases**:
\ No newline at end of file
diff --git a/_sources/generation/diffusionmodel.md b/_sources/generation/diffusionmodel.md
index 51a24a3..00069aa 100644
--- a/_sources/generation/diffusionmodel.md
+++ b/_sources/generation/diffusionmodel.md
@@ -1 +1,64 @@
-# StableAudio - Diffusion-based Model
\ No newline at end of file
+# Diffusion Model-based Text-to-Music Generation
+
+While we could dedicate an entire tutorial to discussing how diffusion works in the context of generative audio (and in fact, others have this year at ISMIR!), here we present a condensed review of how diffusion works before jumping into how text conditioning can be built into these models, using [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) as a case study.
+
+## Diffusion: Continuous Generation through Iterative Refinement
+<center><img alt='generation_diff1' src='../_images/generation/diff1.png' width='50%' ></center>
+
+
+Unlike in the LM-based case where we wish to generate *discrete* tokens $x \in \mathbb{N}$,  the goal of diffusion is to generate some *continuous*-valued data $x \in \mathbb{R}$ (which is identical to the more classical models of VAEs and GANs). 
+
+Formally, if our data comes from some distribution $\mathbf{x} \sim p(\mathbf{x})$, then the goal is to learn some  model that allows us to sample from this distribution $p_\theta(\mathbf{x}) \approx p(\mathbf{x})$. Practically speaking, in order to sample from the data distribution, we parameterize our model as some generator $G_\theta$ such that:
+$$\mathbf{x} = G_\theta(z), \quad z \sim \mathcal{N}(0, \boldsymbol{I}),$$
+i.e. that we learn some model that transforms isotropic gaussian noise into our target data.
+
+One of the **main** reasons why diffusion models have been so succesful at many generative media tasks over these classical models {cite}`dhariwal2021diffusion` (and why they are more controllable) is their ability of **iterative refinement**. In the above equation, the entire generation process occurs in a single model call. While this is certainly efficient (and many diffusion models have been conceptual reinventing GANs to capitalize on their efficiency), this is *a lot* of work to be done in a single pass of the model, especially for high dimensional data! What would be useful is if we had a way to generate *part* of $x$ in a given model call, and then call the model multiple times to fully generate $x$ (and if you're paying attention, this sounds eerily similar to autoregression).
+
+In order to build our "multi-step generator", we have to introduce the concept of first *corrupting* our data into noise (note: while this step doesn't fit as cleanly in our condensed diffusion intro, we encourage readers to check out more complete diffusion writeups that motivate the paradigm through a wider lens {cite}`Song2020ScoreBasedGM`). Formally, we'll first adapt our notation to model a *diffusion* process from clean data to noise notated by the *timestep* $0\rightarrow T$, where $\mathbf{x}_0 \sim p_0(\mathbf{x}_0)$ is our clean data (i.e. $\mathbf{x} \sim p(\mathbf{x})$ previously) and $\mathbf{x}_T \sim p_T(\mathbf{x}_T)$ is pure Gaussian noise (i.e. $z$ previously). Then, we can define a diffusion process that gradually turns our clean data $x_0$ into gaussian noise $x_T$ through the stochastic differential equation (SDE):
+$$\mathrm{d}\mathbf{x} = f(\mathbf{x}, t)\mathrm{d}t + g(t)\mathrm{d}\boldsymbol{w},$$
+where $\boldsymbol{w}$ is a standard Weiner process (i.e. additive Gaussian noise) $f(\mathbf{x}, t)$ is the *drift* coefficient of $\mathbf{x}_t$ and $g(t)$ is the *diffusion* coefficient. And for clarity, we will use $p_t(\mathbf{x})$ to denote the probability density of $\mathbf{x}_t$.
+<center><img alt='generation_diff2' src='../_images/generation/diff2.png' width='50%' ></center>
+
+
+The reason this is relevant at all is that a clever result from {cite}`anderson1982reverse` allows us to define a *reverse* diffusion process that transforms gaussian noise back into data, given by:
+$$\mathrm{d}\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\mathrm{d}t + g(t)\mathrm{d}\bar{\boldsymbol{w}},$$
+where $\bar{\boldsymbol{w}}$ is the reverse-time Weiner proccess and notably, $\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is the *score function* of the marginal probability distribution of $\mathbf{x}_t$. In words, the score function defines a direction pointing towards higher density regions of the data distribution, which you can imagine is something like getting the derivative of a 1-D curved path but in high-dimensional space.
+
+As we now have a way to define the *process* of converting noise to data, we can see that implicitly VAEs/GANs seek to learn a generator the *integrates* the above reverse-time SDE from $T$ to $0$, and thus learn a direct mapping from noise to data. 
+The strength of diffusion models, however, comes instead from learning a *score model* $s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ and iteratively solving the reverse-time SDE in multiple steps, in a sense walking through the reverse diffusion process at some fixed step size and checking the derivative at each point to determine where we should step next. In this way, diffusion models are able to iteratively refine the model output, gradually removing more and more noise from the starting isotropic Gaussian until our data is clear!
+<center><img alt='generation_diff3' src='../_images/generation/diff3.png' width='50%' ></center>
+
+If this all sounds like some weird version of how LMs perform autoregression, you'd be thinking about right! Sander Dieleman has a fantastic [blog post](https://sander.ai/2024/09/02/spectral-autoregression.html) on this conceptual similarity, and how one can imagine diffusion being autoregression, but in the *spectral* domain.
+
+This concludes our intro on diffusion models, and while there's a lot of math here, as long as you understand the core idea that diffusion models approximate the ***gradient*** of the path from noise to data (rather than learning the path itself), you should be fine proceeding through this tutorial!
+
+
+## Representation
+
+Unlike the autoregressive language model approach, the exact input representation for diffusion-based TTM generation has varied a great deal since diffusion hit the scene in 2021. Below we list them, in *rough* chronological order:
+
+1. Direct waveform modeling: $\mathbf{x}_0 \in \mathbb{R}^{f_s T \times 1}$, where $f_s$ is the audio's sampling rate and $T$ is the overall time in seconds. In words, we directly perform the diffusion process on the raw audio signal. This input representation is generally ***not*** used, both because the size of raw audio signals can get quite large (just 30 seconds of 44.1 kHz audio is over 1M floats!), and that diffusion just doesn't work as well on raw audio signals (and theres [good reason](https://sander.ai/2024/09/02/spectral-autoregression.html) for this).
+2. Direct (mel)-Spectrogram modeling {cite}`zhu2023edmsound, Novack2024Ditto, Novack2024DITTO2DD, wu2023music`:  $\mathbf{x}_0 \in \mathbb{R}^{H \times W \times C}$, where $H$ and $W$ are the height and width of the audio (mel)-Spectrogram and $C$ is the number of channels (normally this is just 1, but if using complex spectrograms this can be 2). In this way, TTM diffusion proceeds almost identically to non-latent image diffusion, as we simply treat the audio spectrograms as "images" and run diffusion on these now 2D signals. As we cannot directly convert mel spectrograms back to audio, these models generally train {cite}`Zhu2024MusicHiFiFH` or use an off-the-shelf {cite}`wu2023music` *vocoder* $V(\mathbf{x}_0) : \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{f_s T \times 1}$ to translate from the generated mel spectrogram back to audio.
+3. Latent (mel)-Spectrogram modeling {cite}`liu2023audioldm, liu2023audioldm2, chen2023musicldm, forsgren2022riffusion`: $\mathbf{x}_0 \in \mathbb{R}^{D_h \times D_w \times D_c}$, where $D_h, D_w, D_c$ are the ***latent*** height, width, and number of channels after passing the spectrogram through a 2D **autoencoder** (and in general, $D_h \ll H, D_w \ll W$ for efficiency while $D_c > C$). This is perhaps the first design to really break the scene of TTM generation, with {cite}`forsgren2022riffusion` using the existing Stable Diffusion autoencoder and finetuning SD on spectrograms. This thus requires training a separate VAE $\mathcal{D}, \mathcal{E}$ before training the TTM diffusion model. Once trained, sampling from the model involves generating the latent representation with diffusion, passing this through the decoder $\mathcal{D}(\mathbf{x}_0): \mathbb{R}^{D_h \times D_w \times D_c} \rightarrow  \mathbb{R}^{H \times W \times C} $ and *then* passing that output through the vocoder $V$.
+4. DAC-Style Latent Audio modeling {cite}`stableaudio, evans2024open,Novack2024PrestoDS`: $\mathbf{x}_0 \in \mathbb{R}^{D_T \times 1 \times D_c}$, where $D_T$ is the length of the compressed audio signal, as here we circumvent the vocoder and spectrogram VAE and instead use a **raw-audio VAE** to directly compress the audio into a latent 1D (but multi-channel, as $D_c$ is normally 32/64/96) sequence. Practically, this ends up being nearly identical to the discrete LM codecs like Encodec{cite}`defossez2022highfi` or DAC{cite}`kumar2023high`, with the only difference being that the discrete vector-quantization is replaced with a standard VAE KL regularization, thus giving us **continuous-valued latents** rather than discrete tokens. In fact, much of the rest of the training process and architecture remains the same (i.e. fully convolutional 1D encoder/decoder with snake activations, Multi-Resolution STFT discriminators, etc.). Thus, to sample from the model, we generate the latent representation and directly pass it through the decoder $\mathcal{D}$ to get the audio output. For the rest of the tutorial, we'll focus on this one, as it is what Stable Audio Open uses.
+
+## Architecture
+
+Concerning architecture design, most diffusion models have followed 1 of 2 broad categories: U-Nets and Diffusion Transformers (DiTs). In this work, we focus on DiTs, both because most modern diffusion models are adopting this modeling paradigm {cite}`stableaudio,evans2024open,Novack2024PrestoDS` and that DiTs are *much* simpler in terms of code design. A TTM DiT, in general, looks something like this:
+
+<center><img alt='generation_dit' src='../_images/generation/dit.png' width='50%' ></center>
+
+In words, after the input latent representation is converted to "patches" (i.e. downsampling it further) and the input conditions (i.e. text, more on that later) and timestep/noise level are converted to their corresponding embeddings, a DiT simply is a stack of bidirectional transformer encoder blocks (i.e. like BERT) operating on this latent representation  (with the conditioning providing some form of modulation), followed by a final linear and de-patchifying layer to get our prediction. DiTs have a number of nice scaling properties over U-Nets, are able to handle variable length sequences a bit better, and notably allow for much cleaner code given the lack of manual residual down/up-sampling blocks.
+
+## Conditioning
+
+The big question is now, how does the text conditioning actually do anything in the model? Before hitting the model, the text prompt $\mathbf{c}_{\textrm{text}}$ first has to be converted from a string to some numerical embedding, which we'll call $\mathbf{e}_{\textrm{text}} = \textrm{Emb}(\mathbf{c}_{\textrm{text}})$, where $\textrm{Emb}$ is some embedding extraction function. In many cases, $\textrm{Emb}$ uses a pre-trained text backbone (such as CLAP or T5), followed by 1 or more linear layers to project the embedding to the correct size. After embedding, $\mathbf{e}_{\textrm{text}}$ is either a global embedding $\mathbb{R}^{d}$ or sequence level embedding $\mathbb{R}^{d \times \ell}$  where $d$ is the hidden dimension of the DiT and where $\ell$ denotes the token length of the text embedding (i.e. the text embedding can be extracted per-token, as is the case with T5). 
+
+There are now multiple ways $\mathbf{e}_{\textrm{text}}$  can hit the main diffusion latent inside the model (which can all be combined), so we'll go over a few of them:
+
+1. **Time-Domain Concatenation** (aka In-Context Conditioning or Prefix Conditioning): Here, we simply append the text condition to the diffusion latent sequence to get some new latent $\hat{\mathbf{x}} = [\mathbf{x}, \mathbf{e}_{\textrm{text}}] \in \mathbb{R}^{(D_T + \ell) \times 1 \times d}$, where $[\cdot]$ is the concatenation operation along the *time* axis (i.e. the sequence gets longer), and remove this extra token(s) after all the DiT blocks. In this way, the text operates on the diffusion latent through the DiT's self-attention blocks only, and causes minimal to moderate slowdowns depending on the length of the text embedding.
+2. **Channel-Wise Concatenation**: Here, $\mathbf{e}_{\textrm{text}}$ is first project to have the same *sequence* length as the main diffusion latent, and then is concatenated along the *channel dimension* $\hat{\mathbf{x}} = [\mathbf{x}, \textrm{Proj}(\mathbf{e}_{\textrm{text}})] \in \mathbb{R}^{(D_T) \times 1 \times 2d}$ (i.e. the sequence gets *deeper* in a sense). This is generally not used that much for text conditioning (but is great for other conditions), as it imbues the text with a sort of temporality that does not exist for global captions.
+3. **Cross-Attention**: Here, we add additional cross-attention layers interleaved with the self-attention layers inside each DiT block, where the diffusion latent direct attents to $\mathbf{e}_{\textrm{text}}$. This perhaps offers the best control (and is what Stable Audio Open uses), at the cost of the most added compute given the quadratic cost of each cross-attention layer.
+4. **Adaptive Layer-Norm (AdaLN)**: Here, the layer-norms in each DiT block are augmented with shift, scale, and gate parameters (one for each index of the hidden dimension) that are learned from $\mathbf{e}_{\textrm{text}}$ through a small MLP: $\gamma_{\textrm{shift}}, \gamma_{\textrm{scale}}, \gamma_{\textrm{gate}} = \textrm{MLP}(\mathbf{e}_{\textrm{text}})$. This adds the least computation to the model, and is what the original DiT works uses {cite}`peebles2023scalable`. Note that in spirit, this is practically identical to the Feature-wise Linear Modulation (FiLM) layers used in MusicLDM {cite}`chen2023musicldm`.
+
+<center><img alt='generation_conds' src='../_images/generation/conds.png' width='50%' ></center>
\ No newline at end of file
diff --git a/_sources/generation/intro.md b/_sources/generation/intro.md
index d65d031..256c775 100644
--- a/_sources/generation/intro.md
+++ b/_sources/generation/intro.md
@@ -14,12 +14,12 @@ Style transfer for specific composers was achieved in DeepBach {cite}`musicgener
 Breakthroughs in deep generative models soon led to three notable symbolic music generation models, namely MuseGAN {cite}`musicgenerationtemplate`, Music Transformer {cite}`musicgenerationtemplate`, and MusicVAE {cite}`musicgenerationtemplate`, emerging almost simultaneously between 2018 and 2020. 
 These architectures paved the way for subsequent models focused on higher quality, efficiency, and greater control, such as REMI {cite}`musicgenerationtemplate`, SketchNet {cite}`musicgenerationtemplate`, PianotreeVAE {cite}`musicgenerationtemplate`, Multitrack Music Transformer {cite}`musicgenerationtemplate` and others.
 
-Recently, the development of diffusion model {cite}`musicgenerationtemplate` and the masked generative model {cite}`musicgenerationtemplate` have introduced new paradigms for symbolic music generation. Models such as VampNet {cite}`musicgenerationtemplate` and Polyfussion {cite}`musicgenerationtemplate` have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer {cite}`musicgenerationtemplate` leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks.
+Recently, the development of diffusion model {cite}`musicgenerationtemplate` and the masked generative model {cite}`musicgenerationtemplate` have introduced new paradigms for symbolic music generation. Models such as Polyfussion {cite}`musicgenerationtemplate` have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer {cite}`musicgenerationtemplate` leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks.
 
 Compared to the symbolic music domain, music generation in the audio domain, which focuses on directly generating musical signals, initially faced challenges in generation quality due to data limitations, model architecture constraints, and computational bottlenecks. 
 Early audio generation research primarily focused on speech, exemplified by models like WaveNet {cite}`musicgenerationtemplate` and SampleRNN {cite}`musicgenerationtemplate`. Nsynth {cite}`musicgenerationtemplate`, developed by Google Magenta, marked the first project to synthesize musical signals, which later evolved into DDSP {cite}`musicgenerationtemplate`. OpenAI introduced JukeBox {cite}`musicgenerationtemplate` to generate music directly from the model without relying on synthesis tools from symbolic music notes. SaShiMi {cite}`musicgenerationtemplate` applied the structured state-space model (S4) on music generation.
 
-Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM {cite}`musicgenerationtemplate`, MusicLDM {cite}`musicgenerationtemplate`, Riffusion {cite}`musicgenerationtemplate`, and StableAudio {cite}`musicgenerationtemplate` leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen {cite}`musicgenerationtemplate`, MusicLM {cite}`musicgenerationtemplate`, and MusicGen {cite}`musicgenerationtemplate`. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno {cite}`musicgenerationtemplate` and Udio {cite}`musicgenerationtemplate` also driving this area forward.
+Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM {cite}`musicgenerationtemplate`, MusicLDM {cite}`musicgenerationtemplate`, Riffusion {cite}`musicgenerationtemplate`, and StableAudio {cite}`musicgenerationtemplate` leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen {cite}`musicgenerationtemplate`, MusicLM {cite}`musicgenerationtemplate`,  VampNet {cite}`musicgenerationtemplate`, and MusicGen {cite}`musicgenerationtemplate`. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno {cite}`musicgenerationtemplate` and Udio {cite}`musicgenerationtemplate` also driving this area forward.
 
 In this tutorial, we focus on the audio-domain music generation task, specifically on text-to-music generation. This approach aligns closely with traditional signal-based music understanding, music retrieval tasks, and integrates naturally with language processing, bridging music with natural language inputs.
 
diff --git a/bibliography.html b/bibliography.html
index 2be8b06..46dc2ef 100644
--- a/bibliography.html
+++ b/bibliography.html
@@ -207,7 +207,7 @@
 <li class="toctree-l1"><a class="reference internal" href="generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/code.html">Code Tutoiral</a></li>
 </ul>
@@ -491,6 +491,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>DMP18<span class="fn-bracket">]</span></span>
 <p>Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. <em>arXiv preprint arXiv:1802.04208</em>, 2018.</p>
 </div>
+<div class="citation" id="id61" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>DCSA22<span class="fn-bracket">]</span></span>
+<p>Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. <em>arXiv:2210.13438</em>, 2022.</p>
+</div>
 <div class="citation" id="id37" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>ELBMG07<span class="fn-bracket">]</span></span>
 <p>Douglas Eck, Paul Lamere, Thierry Bertin-Mahieux, and Stephen Green. Automatic generation of social tags for music recommendation. <em>Advances in neural information processing systems</em>, 2007.</p>
diff --git a/description/datasets.html b/description/datasets.html
index c4ee061..b09e379 100644
--- a/description/datasets.html
+++ b/description/datasets.html
@@ -208,7 +208,7 @@
 <li class="toctree-l1"><a class="reference internal" href="../generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/description/evaluation.html b/description/evaluation.html
index d92edd7..5b577cb 100644
--- a/description/evaluation.html
+++ b/description/evaluation.html
@@ -210,7 +210,7 @@
 <li class="toctree-l1"><a class="reference internal" href="../generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/description/models.html b/description/models.html
index f17bea2..60cd225 100644
--- a/description/models.html
+++ b/description/models.html
@@ -210,7 +210,7 @@
 <li class="toctree-l1"><a class="reference internal" href="../generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/description/tasks.html b/description/tasks.html
index 5f2c7ef..0dcef15 100644
--- a/description/tasks.html
+++ b/description/tasks.html
@@ -210,7 +210,7 @@
 <li class="toctree-l1"><a class="reference internal" href="../generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/generation/beyondtext.html b/generation/beyondtext.html
index 6e80467..51d0fbb 100644
--- a/generation/beyondtext.html
+++ b/generation/beyondtext.html
@@ -61,7 +61,7 @@
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
     <link rel="next" title="Code Tutoiral" href="code.html" />
-    <link rel="prev" title="StableAudio - Diffusion-based Model" href="diffusionmodel.html" />
+    <link rel="prev" title="Diffusion Model-based Text-to-Music Generation" href="diffusionmodel.html" />
   <meta name="viewport" content="width=device-width, initial-scale=1"/>
   <meta name="docsearch:language" content="en"/>
   </head>
@@ -208,7 +208,7 @@
 <li class="toctree-l1"><a class="reference internal" href="intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1 current active"><a class="current reference internal" href="#">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="code.html">Code Tutoiral</a></li>
 </ul>
@@ -418,6 +418,16 @@ <h1>Beyond Text-Based Interactions</h1>
                   
   <section class="tex2jax_ignore mathjax_ignore" id="beyond-text-based-interactions">
 <h1>Beyond Text-Based Interactions<a class="headerlink" href="#beyond-text-based-interactions" title="Link to this heading">#</a></h1>
+<p>To wrap up this tutorial, there’s a sort of elephant in the room that has been cropping up more and more in TTM generation discussions: do we <em>even want text</em> as a control method for music generation? This is often paired with a well circulated and unattributed quote:</p>
+<blockquote>
+<div><p><em>“Writing about Music is like Dancing about Architecture”</em></p>
+</div></blockquote>
+<p>To put this more formally, text’s main failure as a control medium stems from its lack of specificity and inability to address fine-grained, musically salient details:</p>
+<ol class="arabic simple">
+<li><p><strong>Low Specificity</strong>: as most text caption datasets are generally boosted from metadata, which itself contains only high level information like genre, mood, function, and at best instrument-level tags, the overall mapping from text-to-music becomes a strongly one-to-many function (as an exercise, one can imagine the number of songs that might fit the caption “exciting and upbeat rock music with drums and guitar”). This means that TTM’s rarely learn to follow text for anything more than genre-level correlations, as within a given high level mode all the text captions are similar in terms of musical content.</p></li>
+<li><p><strong>Inability to Address Fine-Grained Details</strong>: Whiile text is great at describing high-level information and even instrument classes, it’s overall resolution is quite coarse and fails at describing fine-grained features that change at high resolutions. This is particularly bad for music, as many musical controls (volume, melody, chords, rhythm) require fine temporal-resolutions to describe and control accurately.</p></li>
+<li><p><strong>Mismatch with Musically Salient Use-Cases</strong>:</p></li>
+</ol>
 </section>
 
     <script type="text/x-thebe-config">
@@ -456,7 +466,7 @@ <h1>Beyond Text-Based Interactions<a class="headerlink" href="#beyond-text-based
       <i class="fa-solid fa-angle-left"></i>
       <div class="prev-next-info">
         <p class="prev-next-subtitle">previous</p>
-        <p class="prev-next-title">StableAudio - Diffusion-based Model</p>
+        <p class="prev-next-title">Diffusion Model-based Text-to-Music Generation</p>
       </div>
     </a>
     <a class="right-next"
diff --git a/generation/diffusionmodel.html b/generation/diffusionmodel.html
index 1390b95..e92f93e 100644
--- a/generation/diffusionmodel.html
+++ b/generation/diffusionmodel.html
@@ -8,7 +8,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
-    <title>StableAudio - Diffusion-based Model &#8212; Connecting Music Audio and Natural Language</title>
+    <title>Diffusion Model-based Text-to-Music Generation &#8212; Connecting Music Audio and Natural Language</title>
   
   
   
@@ -57,6 +57,8 @@
     <script async="async" src="../_static/sphinx-thebe.js?v=c100c467"></script>
     <script>var togglebuttonSelector = '.toggle, .admonition.dropdown';</script>
     <script>const THEBE_JS_URL = "https://unpkg.com/thebe@0.8.2/lib/index.js"; const thebe_selector = ".thebe,.cell"; const thebe_selector_input = "pre"; const thebe_selector_output = ".output, .cell_output"</script>
+    <script>window.MathJax = {"options": {"processHtmlClass": "tex2jax_process|mathjax_process|math|output_area"}}</script>
+    <script defer="defer" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
     <script>DOCUMENTATION_OPTIONS.pagename = 'generation/diffusionmodel';</script>
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
@@ -208,7 +210,7 @@
 <li class="toctree-l1"><a class="reference internal" href="intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1 current active"><a class="current reference internal" href="#">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1 current active"><a class="current reference internal" href="#">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="code.html">Code Tutoiral</a></li>
 </ul>
@@ -391,7 +393,9 @@
   </button>
 `);
 </script>
-
+<button class="sidebar-toggle secondary-toggle btn btn-sm" title="Toggle secondary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip">
+    <span class="fa-solid fa-list"></span>
+</button>
 </div></div>
       
     </div>
@@ -402,11 +406,22 @@
               
 
 <div id="jb-print-docs-body" class="onlyprint">
-    <h1>StableAudio - Diffusion-based Model</h1>
+    <h1>Diffusion Model-based Text-to-Music Generation</h1>
     <!-- Table of contents -->
     <div id="print-main-content">
         <div id="jb-print-toc">
             
+            <div>
+                <h2> Contents </h2>
+            </div>
+            <nav aria-label="Page">
+                <ul class="visible nav section-nav flex-column">
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#diffusion-continuous-generation-through-iterative-refinement">Diffusion: Continuous Generation through Iterative Refinement</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#representation">Representation</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#architecture">Architecture</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#conditioning">Conditioning</a></li>
+</ul>
+            </nav>
         </div>
     </div>
 </div>
@@ -416,8 +431,58 @@ <h1>StableAudio - Diffusion-based Model</h1>
 <div id="searchbox"></div>
                 <article class="bd-article">
                   
-  <section class="tex2jax_ignore mathjax_ignore" id="stableaudio-diffusion-based-model">
-<h1>StableAudio - Diffusion-based Model<a class="headerlink" href="#stableaudio-diffusion-based-model" title="Link to this heading">#</a></h1>
+  <section class="tex2jax_ignore mathjax_ignore" id="diffusion-model-based-text-to-music-generation">
+<h1>Diffusion Model-based Text-to-Music Generation<a class="headerlink" href="#diffusion-model-based-text-to-music-generation" title="Link to this heading">#</a></h1>
+<p>While we could dedicate an entire tutorial to discussing how diffusion works in the context of generative audio (and in fact, others have this year at ISMIR!), here we present a condensed review of how diffusion works before jumping into how text conditioning can be built into these models, using <a class="reference external" href="https://huggingface.co/stabilityai/stable-audio-open-1.0">Stable Audio Open</a> as a case study.</p>
+<section id="diffusion-continuous-generation-through-iterative-refinement">
+<h2>Diffusion: Continuous Generation through Iterative Refinement<a class="headerlink" href="#diffusion-continuous-generation-through-iterative-refinement" title="Link to this heading">#</a></h2>
+<center><img alt='generation_diff1' src='../_images/generation/diff1.png' width='50%' ></center>
+<p>Unlike in the LM-based case where we wish to generate <em>discrete</em> tokens <span class="math notranslate nohighlight">\(x \in \mathbb{N}\)</span>,  the goal of diffusion is to generate some <em>continuous</em>-valued data <span class="math notranslate nohighlight">\(x \in \mathbb{R}\)</span> (which is identical to the more classical models of VAEs and GANs).</p>
+<p>Formally, if our data comes from some distribution <span class="math notranslate nohighlight">\(\mathbf{x} \sim p(\mathbf{x})\)</span>, then the goal is to learn some  model that allows us to sample from this distribution <span class="math notranslate nohighlight">\(p_\theta(\mathbf{x}) \approx p(\mathbf{x})\)</span>. Practically speaking, in order to sample from the data distribution, we parameterize our model as some generator <span class="math notranslate nohighlight">\(G_\theta\)</span> such that:
+$<span class="math notranslate nohighlight">\(\mathbf{x} = G_\theta(z), \quad z \sim \mathcal{N}(0, \boldsymbol{I}),\)</span>$
+i.e. that we learn some model that transforms isotropic gaussian noise into our target data.</p>
+<p>One of the <strong>main</strong> reasons why diffusion models have been so succesful at many generative media tasks over these classical models <span id="id1">[]</span> (and why they are more controllable) is their ability of <strong>iterative refinement</strong>. In the above equation, the entire generation process occurs in a single model call. While this is certainly efficient (and many diffusion models have been conceptual reinventing GANs to capitalize on their efficiency), this is <em>a lot</em> of work to be done in a single pass of the model, especially for high dimensional data! What would be useful is if we had a way to generate <em>part</em> of <span class="math notranslate nohighlight">\(x\)</span> in a given model call, and then call the model multiple times to fully generate <span class="math notranslate nohighlight">\(x\)</span> (and if you’re paying attention, this sounds eerily similar to autoregression).</p>
+<p>In order to build our “multi-step generator”, we have to introduce the concept of first <em>corrupting</em> our data into noise (note: while this step doesn’t fit as cleanly in our condensed diffusion intro, we encourage readers to check out more complete diffusion writeups that motivate the paradigm through a wider lens <span id="id2">[]</span>). Formally, we’ll first adapt our notation to model a <em>diffusion</em> process from clean data to noise notated by the <em>timestep</em> <span class="math notranslate nohighlight">\(0\rightarrow T\)</span>, where <span class="math notranslate nohighlight">\(\mathbf{x}_0 \sim p_0(\mathbf{x}_0)\)</span> is our clean data (i.e. <span class="math notranslate nohighlight">\(\mathbf{x} \sim p(\mathbf{x})\)</span> previously) and <span class="math notranslate nohighlight">\(\mathbf{x}_T \sim p_T(\mathbf{x}_T)\)</span> is pure Gaussian noise (i.e. <span class="math notranslate nohighlight">\(z\)</span> previously). Then, we can define a diffusion process that gradually turns our clean data <span class="math notranslate nohighlight">\(x_0\)</span> into gaussian noise <span class="math notranslate nohighlight">\(x_T\)</span> through the stochastic differential equation (SDE):
+$<span class="math notranslate nohighlight">\(\mathrm{d}\mathbf{x} = f(\mathbf{x}, t)\mathrm{d}t + g(t)\mathrm{d}\boldsymbol{w},\)</span><span class="math notranslate nohighlight">\(
+where \)</span>\boldsymbol{w}<span class="math notranslate nohighlight">\( is a standard Weiner process (i.e. additive Gaussian noise) \)</span>f(\mathbf{x}, t)<span class="math notranslate nohighlight">\( is the *drift* coefficient of \)</span>\mathbf{x}_t<span class="math notranslate nohighlight">\( and \)</span>g(t)<span class="math notranslate nohighlight">\( is the *diffusion* coefficient. And for clarity, we will use \)</span>p_t(\mathbf{x})<span class="math notranslate nohighlight">\( to denote the probability density of \)</span>\mathbf{x}_t$.</p>
+<center><img alt='generation_diff2' src='../_images/generation/diff2.png' width='50%' ></center>
+<p>The reason this is relevant at all is that a clever result from <span id="id3">[]</span> allows us to define a <em>reverse</em> diffusion process that transforms gaussian noise back into data, given by:
+$<span class="math notranslate nohighlight">\(\mathrm{d}\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]\mathrm{d}t + g(t)\mathrm{d}\bar{\boldsymbol{w}},\)</span><span class="math notranslate nohighlight">\(
+where \)</span>\bar{\boldsymbol{w}}<span class="math notranslate nohighlight">\( is the reverse-time Weiner proccess and notably, \)</span>\nabla_{\mathbf{x}}\log p_t(\mathbf{x})<span class="math notranslate nohighlight">\( is the *score function* of the marginal probability distribution of \)</span>\mathbf{x}_t$. In words, the score function defines a direction pointing towards higher density regions of the data distribution, which you can imagine is something like getting the derivative of a 1-D curved path but in high-dimensional space.</p>
+<p>As we now have a way to define the <em>process</em> of converting noise to data, we can see that implicitly VAEs/GANs seek to learn a generator the <em>integrates</em> the above reverse-time SDE from <span class="math notranslate nohighlight">\(T\)</span> to <span class="math notranslate nohighlight">\(0\)</span>, and thus learn a direct mapping from noise to data.
+The strength of diffusion models, however, comes instead from learning a <em>score model</em> <span class="math notranslate nohighlight">\(s_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x})\)</span> and iteratively solving the reverse-time SDE in multiple steps, in a sense walking through the reverse diffusion process at some fixed step size and checking the derivative at each point to determine where we should step next. In this way, diffusion models are able to iteratively refine the model output, gradually removing more and more noise from the starting isotropic Gaussian until our data is clear!</p>
+<center><img alt='generation_diff3' src='../_images/generation/diff3.png' width='50%' ></center>
+<p>If this all sounds like some weird version of how LMs perform autoregression, you’d be thinking about right! Sander Dieleman has a fantastic <a class="reference external" href="https://sander.ai/2024/09/02/spectral-autoregression.html">blog post</a> on this conceptual similarity, and how one can imagine diffusion being autoregression, but in the <em>spectral</em> domain.</p>
+<p>This concludes our intro on diffusion models, and while there’s a lot of math here, as long as you understand the core idea that diffusion models approximate the <em><strong>gradient</strong></em> of the path from noise to data (rather than learning the path itself), you should be fine proceeding through this tutorial!</p>
+</section>
+<section id="representation">
+<h2>Representation<a class="headerlink" href="#representation" title="Link to this heading">#</a></h2>
+<p>Unlike the autoregressive language model approach, the exact input representation for diffusion-based TTM generation has varied a great deal since diffusion hit the scene in 2021. Below we list them, in <em>rough</em> chronological order:</p>
+<ol class="arabic simple">
+<li><p>Direct waveform modeling: <span class="math notranslate nohighlight">\(\mathbf{x}_0 \in \mathbb{R}^{f_s T \times 1}\)</span>, where <span class="math notranslate nohighlight">\(f_s\)</span> is the audio’s sampling rate and <span class="math notranslate nohighlight">\(T\)</span> is the overall time in seconds. In words, we directly perform the diffusion process on the raw audio signal. This input representation is generally <em><strong>not</strong></em> used, both because the size of raw audio signals can get quite large (just 30 seconds of 44.1 kHz audio is over 1M floats!), and that diffusion just doesn’t work as well on raw audio signals (and theres <a class="reference external" href="https://sander.ai/2024/09/02/spectral-autoregression.html">good reason</a> for this).</p></li>
+<li><p>Direct (mel)-Spectrogram modeling <span id="id4">[<a class="reference internal" href="../bibliography.html#id12" title="Shih-Lun Wu, Chris Donahue, Shinji Watanabe, and Nicholas J Bryan. Music controlnet: multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.">WDWB23</a>]</span>:  <span class="math notranslate nohighlight">\(\mathbf{x}_0 \in \mathbb{R}^{H \times W \times C}\)</span>, where <span class="math notranslate nohighlight">\(H\)</span> and <span class="math notranslate nohighlight">\(W\)</span> are the height and width of the audio (mel)-Spectrogram and <span class="math notranslate nohighlight">\(C\)</span> is the number of channels (normally this is just 1, but if using complex spectrograms this can be 2). In this way, TTM diffusion proceeds almost identically to non-latent image diffusion, as we simply treat the audio spectrograms as “images” and run diffusion on these now 2D signals. As we cannot directly convert mel spectrograms back to audio, these models generally train <span id="id5">[]</span> or use an off-the-shelf <span id="id6">[<a class="reference internal" href="../bibliography.html#id12" title="Shih-Lun Wu, Chris Donahue, Shinji Watanabe, and Nicholas J Bryan. Music controlnet: multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.">WDWB23</a>]</span> <em>vocoder</em> <span class="math notranslate nohighlight">\(V(\mathbf{x}_0) : \mathbb{R}^{H \times W \times C} \rightarrow \mathbb{R}^{f_s T \times 1}\)</span> to translate from the generated mel spectrogram back to audio.</p></li>
+<li><p>Latent (mel)-Spectrogram modeling <span id="id7">[<a class="reference internal" href="../bibliography.html#id15" title="Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Musicldm: enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. arXiv preprint arXiv:2308.01546, 2023.">CWL+23</a>]</span>: <span class="math notranslate nohighlight">\(\mathbf{x}_0 \in \mathbb{R}^{D_h \times D_w \times D_c}\)</span>, where <span class="math notranslate nohighlight">\(D_h, D_w, D_c\)</span> are the <em><strong>latent</strong></em> height, width, and number of channels after passing the spectrogram through a 2D <strong>autoencoder</strong> (and in general, <span class="math notranslate nohighlight">\(D_h \ll H, D_w \ll W\)</span> for efficiency while <span class="math notranslate nohighlight">\(D_c &gt; C\)</span>). This is perhaps the first design to really break the scene of TTM generation, with <span id="id8">[]</span> using the existing Stable Diffusion autoencoder and finetuning SD on spectrograms. This thus requires training a separate VAE <span class="math notranslate nohighlight">\(\mathcal{D}, \mathcal{E}\)</span> before training the TTM diffusion model. Once trained, sampling from the model involves generating the latent representation with diffusion, passing this through the decoder <span class="math notranslate nohighlight">\(\mathcal{D}(\mathbf{x}_0): \mathbb{R}^{D_h \times D_w \times D_c} \rightarrow  \mathbb{R}^{H \times W \times C} \)</span> and <em>then</em> passing that output through the vocoder <span class="math notranslate nohighlight">\(V\)</span>.</p></li>
+<li><p>DAC-Style Latent Audio modeling <span id="id9">[]</span>: <span class="math notranslate nohighlight">\(\mathbf{x}_0 \in \mathbb{R}^{D_T \times 1 \times D_c}\)</span>, where <span class="math notranslate nohighlight">\(D_T\)</span> is the length of the compressed audio signal, as here we circumvent the vocoder and spectrogram VAE and instead use a <strong>raw-audio VAE</strong> to directly compress the audio into a latent 1D (but multi-channel, as <span class="math notranslate nohighlight">\(D_c\)</span> is normally 32/64/96) sequence. Practically, this ends up being nearly identical to the discrete LM codecs like Encodec<span id="id10">[<a class="reference internal" href="../bibliography.html#id61" title="Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438, 2022.">DCSA22</a>]</span> or DAC<span id="id11">[]</span>, with the only difference being that the discrete vector-quantization is replaced with a standard VAE KL regularization, thus giving us <strong>continuous-valued latents</strong> rather than discrete tokens. In fact, much of the rest of the training process and architecture remains the same (i.e. fully convolutional 1D encoder/decoder with snake activations, Multi-Resolution STFT discriminators, etc.). Thus, to sample from the model, we generate the latent representation and directly pass it through the decoder <span class="math notranslate nohighlight">\(\mathcal{D}\)</span> to get the audio output. For the rest of the tutorial, we’ll focus on this one, as it is what Stable Audio Open uses.</p></li>
+</ol>
+</section>
+<section id="architecture">
+<h2>Architecture<a class="headerlink" href="#architecture" title="Link to this heading">#</a></h2>
+<p>Concerning architecture design, most diffusion models have followed 1 of 2 broad categories: U-Nets and Diffusion Transformers (DiTs). In this work, we focus on DiTs, both because most modern diffusion models are adopting this modeling paradigm <span id="id12">[]</span> and that DiTs are <em>much</em> simpler in terms of code design. A TTM DiT, in general, looks something like this:</p>
+<center><img alt='generation_dit' src='../_images/generation/dit.png' width='50%' ></center>
+<p>In words, after the input latent representation is converted to “patches” (i.e. downsampling it further) and the input conditions (i.e. text, more on that later) and timestep/noise level are converted to their corresponding embeddings, a DiT simply is a stack of bidirectional transformer encoder blocks (i.e. like BERT) operating on this latent representation  (with the conditioning providing some form of modulation), followed by a final linear and de-patchifying layer to get our prediction. DiTs have a number of nice scaling properties over U-Nets, are able to handle variable length sequences a bit better, and notably allow for much cleaner code given the lack of manual residual down/up-sampling blocks.</p>
+</section>
+<section id="conditioning">
+<h2>Conditioning<a class="headerlink" href="#conditioning" title="Link to this heading">#</a></h2>
+<p>The big question is now, how does the text conditioning actually do anything in the model? Before hitting the model, the text prompt <span class="math notranslate nohighlight">\(\mathbf{c}_{\textrm{text}}\)</span> first has to be converted from a string to some numerical embedding, which we’ll call <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}} = \textrm{Emb}(\mathbf{c}_{\textrm{text}})\)</span>, where <span class="math notranslate nohighlight">\(\textrm{Emb}\)</span> is some embedding extraction function. In many cases, <span class="math notranslate nohighlight">\(\textrm{Emb}\)</span> uses a pre-trained text backbone (such as CLAP or T5), followed by 1 or more linear layers to project the embedding to the correct size. After embedding, <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}}\)</span> is either a global embedding <span class="math notranslate nohighlight">\(\mathbb{R}^{d}\)</span> or sequence level embedding <span class="math notranslate nohighlight">\(\mathbb{R}^{d \times \ell}\)</span>  where <span class="math notranslate nohighlight">\(d\)</span> is the hidden dimension of the DiT and where <span class="math notranslate nohighlight">\(\ell\)</span> denotes the token length of the text embedding (i.e. the text embedding can be extracted per-token, as is the case with T5).</p>
+<p>There are now multiple ways <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}}\)</span>  can hit the main diffusion latent inside the model (which can all be combined), so we’ll go over a few of them:</p>
+<ol class="arabic simple">
+<li><p><strong>Time-Domain Concatenation</strong> (aka In-Context Conditioning or Prefix Conditioning): Here, we simply append the text condition to the diffusion latent sequence to get some new latent <span class="math notranslate nohighlight">\(\hat{\mathbf{x}} = [\mathbf{x}, \mathbf{e}_{\textrm{text}}] \in \mathbb{R}^{(D_T + \ell) \times 1 \times d}\)</span>, where <span class="math notranslate nohighlight">\([\cdot]\)</span> is the concatenation operation along the <em>time</em> axis (i.e. the sequence gets longer), and remove this extra token(s) after all the DiT blocks. In this way, the text operates on the diffusion latent through the DiT’s self-attention blocks only, and causes minimal to moderate slowdowns depending on the length of the text embedding.</p></li>
+<li><p><strong>Channel-Wise Concatenation</strong>: Here, <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}}\)</span> is first project to have the same <em>sequence</em> length as the main diffusion latent, and then is concatenated along the <em>channel dimension</em> <span class="math notranslate nohighlight">\(\hat{\mathbf{x}} = [\mathbf{x}, \textrm{Proj}(\mathbf{e}_{\textrm{text}})] \in \mathbb{R}^{(D_T) \times 1 \times 2d}\)</span> (i.e. the sequence gets <em>deeper</em> in a sense). This is generally not used that much for text conditioning (but is great for other conditions), as it imbues the text with a sort of temporality that does not exist for global captions.</p></li>
+<li><p><strong>Cross-Attention</strong>: Here, we add additional cross-attention layers interleaved with the self-attention layers inside each DiT block, where the diffusion latent direct attents to <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}}\)</span>. This perhaps offers the best control (and is what Stable Audio Open uses), at the cost of the most added compute given the quadratic cost of each cross-attention layer.</p></li>
+<li><p><strong>Adaptive Layer-Norm (AdaLN)</strong>: Here, the layer-norms in each DiT block are augmented with shift, scale, and gate parameters (one for each index of the hidden dimension) that are learned from <span class="math notranslate nohighlight">\(\mathbf{e}_{\textrm{text}}\)</span> through a small MLP: <span class="math notranslate nohighlight">\(\gamma_{\textrm{shift}}, \gamma_{\textrm{scale}}, \gamma_{\textrm{gate}} = \textrm{MLP}(\mathbf{e}_{\textrm{text}})\)</span>. This adds the least computation to the model, and is what the original DiT works uses <span id="id13">[]</span>. Note that in spirit, this is practically identical to the Feature-wise Linear Modulation (FiLM) layers used in MusicLDM <span id="id14">[<a class="reference internal" href="../bibliography.html#id15" title="Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Musicldm: enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. arXiv preprint arXiv:2308.01546, 2023.">CWL+23</a>]</span>.</p></li>
+</ol>
+<center><img alt='generation_conds' src='../_images/generation/conds.png' width='50%' ></center>
+</section>
 </section>
 
     <script type="text/x-thebe-config">
@@ -475,6 +540,24 @@ <h1>StableAudio - Diffusion-based Model<a class="headerlink" href="#stableaudio-
             
             
               
+                <div class="bd-sidebar-secondary bd-toc"><div class="sidebar-secondary-items sidebar-secondary__inner">
+
+
+  <div class="sidebar-secondary-item">
+  <div class="page-toc tocsection onthispage">
+    <i class="fa-solid fa-list"></i> Contents
+  </div>
+  <nav class="bd-toc-nav page-toc">
+    <ul class="visible nav section-nav flex-column">
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#diffusion-continuous-generation-through-iterative-refinement">Diffusion: Continuous Generation through Iterative Refinement</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#representation">Representation</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#architecture">Architecture</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#conditioning">Conditioning</a></li>
+</ul>
+  </nav></div>
+
+</div></div>
+              
             
           </div>
           <footer class="bd-footer-content">
diff --git a/generation/intro.html b/generation/intro.html
index ef7731c..4ccbf7f 100644
--- a/generation/intro.html
+++ b/generation/intro.html
@@ -208,7 +208,7 @@
 <li class="toctree-l1 current active"><a class="current reference internal" href="#">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="code.html">Code Tutoiral</a></li>
 </ul>
@@ -439,10 +439,10 @@ <h2>History<a class="headerlink" href="#history" title="Link to this heading">#<
 Style transfer for specific composers was achieved in DeepBach <span id="id5">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, which generated Bach-style chorales in work by Sony CSL.
 Breakthroughs in deep generative models soon led to three notable symbolic music generation models, namely MuseGAN <span id="id6">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, Music Transformer <span id="id7">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, and MusicVAE <span id="id8">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, emerging almost simultaneously between 2018 and 2020.
 These architectures paved the way for subsequent models focused on higher quality, efficiency, and greater control, such as REMI <span id="id9">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, SketchNet <span id="id10">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, PianotreeVAE <span id="id11">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, Multitrack Music Transformer <span id="id12">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and others.</p>
-<p>Recently, the development of diffusion model <span id="id13">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and the masked generative model <span id="id14">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> have introduced new paradigms for symbolic music generation. Models such as VampNet <span id="id15">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and Polyfussion <span id="id16">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer <span id="id17">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks.</p>
+<p>Recently, the development of diffusion model <span id="id13">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and the masked generative model <span id="id14">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> have introduced new paradigms for symbolic music generation. Models such as Polyfussion <span id="id15">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> have expanded the possibilities and inspired further innovation in this field. Additionally, the Anticipatory Music Transformer <span id="id16">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> leverages language model architectures to achieve impressive performance across a broad spectrum of symbolic music generation tasks.</p>
 <p>Compared to the symbolic music domain, music generation in the audio domain, which focuses on directly generating musical signals, initially faced challenges in generation quality due to data limitations, model architecture constraints, and computational bottlenecks.
-Early audio generation research primarily focused on speech, exemplified by models like WaveNet <span id="id18">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and SampleRNN <span id="id19">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. Nsynth <span id="id20">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, developed by Google Magenta, marked the first project to synthesize musical signals, which later evolved into DDSP <span id="id21">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. OpenAI introduced JukeBox <span id="id22">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> to generate music directly from the model without relying on synthesis tools from symbolic music notes. SaShiMi <span id="id23">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> applied the structured state-space model (S4) on music generation.</p>
-<p>Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM <span id="id24">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, MusicLDM <span id="id25">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, Riffusion <span id="id26">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, and StableAudio <span id="id27">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen <span id="id28">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, MusicLM <span id="id29">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, and MusicGen <span id="id30">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno <span id="id31">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and Udio <span id="id32">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> also driving this area forward.</p>
+Early audio generation research primarily focused on speech, exemplified by models like WaveNet <span id="id17">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and SampleRNN <span id="id18">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. Nsynth <span id="id19">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, developed by Google Magenta, marked the first project to synthesize musical signals, which later evolved into DDSP <span id="id20">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. OpenAI introduced JukeBox <span id="id21">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> to generate music directly from the model without relying on synthesis tools from symbolic music notes. SaShiMi <span id="id22">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> applied the structured state-space model (S4) on music generation.</p>
+<p>Recently, latent diffusion models have been adapted for audio generation, with models like AudioLDM <span id="id23">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, MusicLDM <span id="id24">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, Riffusion <span id="id25">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, and StableAudio <span id="id26">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> leading the way. Language model architectures are also advancing this field, with developments in models such as AudioGen <span id="id27">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, MusicLM <span id="id28">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>,  VampNet <span id="id29">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>, and MusicGen <span id="id30">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span>. Text-to-music generation has become a trending topic, particularly in generative and multi-modal learning tasks, with contributions from startups like Suno <span id="id31">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> and Udio <span id="id32">[<a class="reference internal" href="../bibliography.html#id60" title="Ke Chen and Zachary Novack. Music generation reference template. UCSD, 2024.">CN24</a>]</span> also driving this area forward.</p>
 <p>In this tutorial, we focus on the audio-domain music generation task, specifically on text-to-music generation. This approach aligns closely with traditional signal-based music understanding, music retrieval tasks, and integrates naturally with language processing, bridging music with natural language inputs.</p>
 </section>
 <section id="problem-definition">
diff --git a/genindex.html b/genindex.html
index 368f3c1..e9865b9 100644
--- a/genindex.html
+++ b/genindex.html
@@ -215,7 +215,7 @@
 <li class="toctree-l1"><a class="reference internal" href="generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/intro.html b/intro.html
index bac8018..ab2364b 100644
--- a/intro.html
+++ b/intro.html
@@ -211,7 +211,7 @@
 <li class="toctree-l1"><a class="reference internal" href="generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/objects.inv b/objects.inv
index 9abcc4b..4ee930d 100644
--- a/objects.inv
+++ b/objects.inv
@@ -2,13 +2,10 @@
 # Project: Project name not set
 # Version: 
 # The remainder of this file is compressed using zlib.
-xڕVM��0��W���. q�XvET�r�ibp��v���gb'�����̛���R�1'���;�2�����Ѽ�˕>�9
-<�KW�g�!M2�I����r�7a����h%ə�3��#�Ä��+�:�"���O)��tǙc�Bp������p�jz���V�p��Mk�rU,���_�w#�*��H_|ܑ�-t��e5�qs�1ۺ`���N�d2��%�AWjB����)2�{,�^<ГnM��{K�ApEm��[֥z�ޚLf���|l�c����,3("y��^Ӥ@���'����wx��m�:�g��Î�k"��r��s�zLԑi�Sa�����'����) �r,����BC
-7�G��b�P�<�̝u�M#�	KǶ�����m�ֈ[�n�5�Bq���[���gR
-KBp~�#��F׊w,�T�@��k��;���x�s%���������Ig|X��R����y���
-��Z��5)I�FY'\����a^*-uq~���^&��P!�JYC�0_FL��\M܁�L5՜6���)G&� O]B��t]Š�.�}�ЮE�QG���̞�E��񽨜o`'ﲚ�[a]sZ��f�s8����!�%������y�W��S���a��!e=	W6���5�	d�N�
-*?�fܲ�ɚ�"6
-�m�
-N{���0��A2�y�窯z���S�����T�����CE�L
-�S��t.�K�qY'Q�iꏳ+��������E��+k�U����#w@?��N-sω�C��eO�m3x�ޗ3Y�ggMFD��^Yp:�UN�A9�p�h
-��*CλG�$��)��:�6�Kc��Ξ�\�	3�Hz���:�s�H-'BJ}�����@��I,)t^Ӆ��Z��B:VOy��Ǆi
\ No newline at end of file
+xڕVKo�8��W�{h�i쥗��]u[l�P�X�"����#R�47{ߌ��|$SJ;�V`/Y�ҷ �3��y��g4g��m�*�ǟ�&�ȤЅa����Λ����6�$9��xfr\�7��ҡŧC|J����������͍�7]D��i������nH������XG��_v�F�U.	�<��x"S_��߽5�qs�1N���d�:�N�d���k��<5�8��}=Pdv�X��#Fгi͜�[K�QrEc����k��7����p�`��c=t�Π��9�Ӥ@���9��dxՊ;|t�g-����Á�-��N9
+��l=%��5�0��(���ѓ��
+�8( �u�p�i���:�����C�%�I�JE���f���B�.:�8]VO}q�h���k�.��1:kW�~&���	�'蒱�Wat�xG�]�
+��69��/ȌW>W"�����r>��-I���[�Y
+W1��wo����=���eM�A�S�	W;lr��JK]\_�w���l�cYTH�R�P�cͅF���\M4�/L5���>��3S�\8C���Z?��AY]��de\�����m�?Y\�U��&����)�����l!�NX���@��5�-ը	�-��>����(Ͳȍ���8V�Eًpes1��&e��QA��9n�~ɖ�"6
+��.G��h���$��<�tt3t��pÖm�}A���p�����6&�r��Aׅ���R�Fu�I4�g����ʠS�l�ZT�>�&\��^y7=s�7��g�eG(B����no9�����v�dB�e���'������ʐ��}7�0	
+���M�����'6�'|���$=�\{��^4Rˉ�R�*�Fc���ϙĒB��\��F��F!+��<c��m�
\ No newline at end of file
diff --git a/search.html b/search.html
index c6300e0..dc6c359 100644
--- a/search.html
+++ b/search.html
@@ -217,7 +217,7 @@
 <li class="toctree-l1"><a class="reference internal" href="generation/intro.html">Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/evaluation.html">Evaluation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/lmmodel.html">MusicGEN</a></li>
-<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">StableAudio - Diffusion-based Model</a></li>
+<li class="toctree-l1"><a class="reference internal" href="generation/diffusionmodel.html">Diffusion Model-based Text-to-Music Generation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/beyondtext.html">Beyond Text-Based Interactions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="generation/code.html">Code Tutoiral</a></li>
 </ul>
diff --git a/searchindex.js b/searchindex.js
index 54dc5d5..77742ba 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[15, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[15, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[15, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[14, "about-the-authors"]], "Adapting LLMs to audio-text inputs": [[6, "adapting-llms-to-audio-text-inputs"]], "Audio Diversity and Quality": [[11, "audio-diversity-and-quality"]], "Audio-Text Joint Embedding": [[22, null]], "Background": [[16, null]], "Beyond Text-Based Interactions": [[8, null]], "Bibliography": [[0, null]], "Code Tutoiral": [[9, null]], "Code Tutorial": [[2, null], [19, null]], "Conclusion": [[1, null]], "Connecting Music Audio and Natural Language": [[14, null]], "Conversational Retrieval": [[20, null]], "Datasets with synthetic text": [[3, "datasets-with-synthetic-text"]], "Early Stage of Music Annotation and Retrieval": [[16, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[16, "early-stage-of-music-generation"]], "Encoder-Decoder Models": [[6, "encoder-decoder-models"]], "Evaluation": [[11, null]], "From captions to dialogues": [[7, "from-captions-to-dialogues"]], "From labels to captions": [[7, "from-labels-to-captions"]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[11, "frechet-inception-distance-fid-fad"]], "Getting Started": [[14, "getting-started"]], "History": [[12, "history"]], "Inception Score": [[11, "inception-score"]], "Introduction": [[12, null]], "Introduction to Text-to-Music Retrieval": [[21, null]], "Langauge Models": [[17, "langauge-models"]], "Language Model": [[18, null]], "Limitation": [[11, "limitation"]], "Listening Test": [[11, "listening-test"]], "MOS Test (Mean Opinion Score)": [[11, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[11, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Match-based metrics": [[4, "match-based-metrics"]], "Motivation & Aims": [[14, "motivation-aims"]], "Multimodal Autoregressive Transformers": [[6, "multimodal-autoregressive-transformers"]], "Music Annotation (Music -> Natural Language)": [[17, "music-annotation-music-natural-language"]], "Music Caption Datasets": [[3, "music-caption-datasets"]], "Music Captioning": [[7, "music-captioning"]], "Music Description Datasets": [[3, null]], "Music Description Evaluation": [[4, null]], "Music Description Models": [[6, null]], "Music Description Tasks": [[7, null]], "Music Dialogue Generation": [[7, "music-dialogue-generation"]], "Music Generation (Natural Language -> Sampled Music)": [[17, "music-generation-natural-language-sampled-music"]], "Music Question Answering": [[7, "music-question-answering"]], "Music Retrieval (Natural Language -> Database Music)": [[17, "music-retrieval-natural-language-database-music"]], "MusicCaps (MC)": [[3, "musiccaps-mc"]], "MusicGEN": [[13, null], [13, "id4"]], "Neural Audio Codec": [[13, "neural-audio-codec"]], "Other Modelling Paradigms": [[6, "other-modelling-paradigms"]], "Other types of automatic evaluation": [[4, "other-types-of-automatic-evaluation"]], "Overview": [[3, "overview"], [5, null]], "Overview of Tutorial": [[17, null]], "Part 1: Music Caption Generation": [[2, "part-1-music-caption-generation"]], "Part 2: Evaluation": [[2, "part-2-evaluation"]], "Problem Definition": [[12, "problem-definition"]], "References": [[3, "references"], [4, "references"], [6, "references"], [7, "references"]], "StableAudio - Diffusion-based Model": [[10, null]], "Text Relevance": [[11, "text-relevance"]], "The Song Describer Dataset (SDD)": [[3, "the-song-describer-dataset-sdd"]], "What is music description?": [[5, "what-is-music-description"]], "Why Natural Langauge?": [[15, null]], "Why do we need automatic music description?": [[5, "why-do-we-need-automatic-music-description"]], "YouTube8M-MusicTextClips (YT8M-MTC)": [[3, "youtube8m-musictextclips-yt8m-mtc"]]}, "docnames": ["bibliography", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/beyondtext", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/intro", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/intro", "retrieval/joint_embedding"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/beyondtext.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/intro.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/intro.md", "retrieval/joint_embedding.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [3, 6, 7, 14, 15, 16, 17], "0": [3, 7, 11, 15], "00341": 0, "01": 0, "01337": 0, "01420": 0, "01546": 0, "01652": 0, "01917": 0, "02": 0, "03": 0, "03499": 0, "04": 0, "04208": 0, "04805": 0, "04868": 0, "06": 0, "07": 0, "07069": 0, "07160": 0, "07837": 0, "07848": 0, "07919": 0, "08": 0, "08774": 0, "1": [0, 3, 4, 6, 7, 9, 11, 19], "10": [0, 3, 7, 14], "100": 11, "10057": 0, "10191775": 0, "10447027": 0, "1076": 0, "11": [], "1109": 0, "11255": 0, "11325": 0, "11401": 0, "11415": 0, "11834": 0, "1186": [], "12": 15, "1208": 0, "120bpm": 15, "1212": 0, "12179": 0, "12208": 0, "12415": 0, "13": 15, "130bpm": 15, "14": 14, "140": 0, "14784": 0, "14793": 0, "14rn7hpkvk": 0, "150": 15, "15885": 0, "16": [0, 15, 16], "1608": 0, "1609": 0, "1612": 0, "16372": 0, "16798": 0, "17": [0, 16], "17th": 0, "18": [0, 16, 17], "1802": 0, "1805": 0, "1810": 0, "18653": 0, "19": [0, 17], "1950": 12, "1990": 12, "1d": 13, "1k": [], "2": [0, 3, 4, 17], "20": [0, 14, 17], "200": 21, "2000": 16, "2002": 0, "2003": 0, "2005": [0, 16], "2007": 0, "2008": 0, "2010": [0, 16], "2015": 12, "2016": 0, "2017": 0, "2018": [0, 12], "2019": 0, "2020": [0, 12], "2021": [0, 14], "2022": 0, "2023": 0, "2024": [0, 14], "20445": 0, "21": [0, 15, 17], "2109": 0, "2161": 0, "22": [0, 15, 17], "2205": 0, "2208": 0, "22k": 3, "23": [0, 3, 7, 17], "2301": 0, "2303": 0, "2305": 0, "2307": 0, "2308": 0, "231": 0, "2310": 0, "2311": 0, "2392": 0, "2396": 0, "24": [0, 4, 6, 7, 17], "2401": 0, "2406": 0, "2407": 0, "2408": 0, "25": 0, "25th": [0, 14], "26": 0, "263": [], "264k": 3, "27": [], "2713830": 0, "2754": 0, "2764": 0, "28492": 0, "28518": 0, "286": 0, "28k": 3, "29": 3, "290": 0, "293": 0, "2k": 3, "2m": 3, "3": [0, 3, 4, 14, 16, 17], "30": [3, 7], "302": 0, "3169": [], "32": 0, "33k": 3, "3643": 0, "3655": 0, "4": [0, 3, 4, 14, 15, 17], "42": 15, "4407": 0, "44100": [2, 9, 19], "456": 0, "460": 0, "467": 0, "46th": 0, "476": 0, "48550": 0, "5": [0, 2, 3, 9, 11, 15, 17, 19], "50": 21, "52": [], "521": [], "525": [], "53": 0, "531": 0, "534": 0, "53k": 3, "540": [], "55bpm": 15, "566": [], "5k": 3, "6": [3, 15], "65": [], "67": 0, "7": [3, 15], "70": 0, "768": [], "784": 0, "7952585": 0, "8": [0, 3, 15], "800560": 0, "84": [], "86": 0, "8748": 0, "8763": 0, "8821": 0, "8831": 0, "88k": 3, "9": 3, "937": 0, "952": 0, "953": [], "9k": 3, "A": [0, 3, 7, 11, 16], "And": 14, "As": [4, 5, 6, 7, 13, 16, 21], "At": 6, "BY": 3, "Being": 5, "But": 6, "By": 12, "For": [2, 5, 6, 7, 15], "If": [16, 21], "In": [0, 2, 5, 7, 11, 12, 13, 14, 15, 16, 21], "It": 17, "No": 15, "One": [6, 11], "The": [0, 4, 5, 6, 11, 12, 13, 14, 15, 16, 17], "Then": 12, "There": 6, "These": [3, 4, 6, 7, 11, 12, 16], "To": [11, 14], "With": 14, "_c": [], "_get_default_devic": [], "_i": [], "aaa": [0, 17], "aaai": 14, "aaron": 0, "ab": 0, "abil": [0, 11, 15, 16], "abl": [5, 11], "ablat": 11, "about": [7, 15], "abov": [12, 13, 15], "abstract": [0, 4, 5], "abu": 0, "academ": 14, "access": 14, "accompani": [3, 12], "accur": [11, 12], "achiam": 0, "achiev": [6, 12], "aclanthologi": 0, "acm": 0, "acoust": [0, 3, 15], "across": [12, 14], "activ": 14, "actual": 15, "ad": [4, 16], "adam": 0, "adapt": [7, 11, 12], "adb": [0, 3, 17], "add": [], "addit": [11, 14], "addition": [11, 12, 15, 17], "address": [7, 17], "adi": 0, "aditya": 0, "adler": 0, "adob": 14, "adopt": [4, 11], "advanc": [0, 11, 12, 14, 15, 16, 17], "advantag": [15, 16], "adversari": [0, 13, 16], "advis": 14, "aesthet": 0, "afternoon": 3, "against": 11, "agarw": 0, "aggreg": [0, 4], "aggress": 15, "agostinelli": 0, "ahmad": 0, "ai": [14, 15], "aim": [12, 16], "akkaya": 0, "al": 6, "album": 3, "alec": 0, "aleman": 0, "alex": 0, "alexandr": 0, "algorithm": 12, "align": 12, "all": [7, 11, 14, 21], "allow": [7, 14, 15], "almeida": 0, "almost": 12, "along": [3, 5], "alongsid": 6, "also": [4, 5, 7, 11, 12, 14, 15, 21], "altenschmidt": 0, "altern": [3, 16, 17], "although": 15, "altman": 0, "amanda": 0, "amit": 0, "amodei": 0, "among": [3, 4], "amu": [5, 7], "an": [0, 2, 3, 4, 5, 6, 7, 11, 13, 14, 15, 17, 21], "anadkat": 0, "analogi": 0, "analys": 5, "anderson2016": [], "andrea": 0, "andrew": 0, "ann": 14, "annot": [0, 5, 15, 21], "anoth": 6, "answer": [0, 2, 3, 4, 15], "anticipatori": 12, "antoin": 0, "appear": 6, "appl": 14, "appli": [3, 6, 11, 12, 14], "applic": [5, 7, 14, 16, 17], "approach": [6, 12, 14, 16], "appropri": [11, 16, 21], "ar": [0, 3, 5, 6, 7, 11, 12, 14, 15, 16, 17], "arab": 0, "arbitrari": 14, "arbor": 14, "architectur": [6, 11, 12, 13, 15], "area": [11, 12, 14, 16], "aren": 0, "art": [0, 6, 12], "articul": 12, "artifici": 14, "artist": [12, 16], "arun": 0, "arxiv": 0, "askel": 0, "aspect": [6, 7, 11], "assess": 11, "assign": [7, 11, 12], "associ": [0, 17], "atin": 0, "attempt": 16, "attent": [6, 13], "attribut": [15, 16, 17], "audio": [0, 2, 3, 4, 5, 7, 9, 12, 16, 19, 21], "audio_base64": 3, "audio_byt": 3, "audio_html": 3, "audio_sampl": [2, 9, 19], "audiogen": 12, "audioldm": 12, "audioset": 3, "auditori": 17, "augment": [0, 3], "august": 0, "auto": [0, 7, 13, 17], "autoencod": [0, 16], "autom": 16, "automat": [0, 14], "autoregress": 12, "autoregresst": 12, "avail": 16, "averag": [4, 11], "awai": [3, 4], "awar": 7, "ax": 5, "axi": [5, 15], "b": [], "b64encod": 3, "bach": [0, 12], "back": 12, "balog": 0, "barret": 0, "barrington": 0, "base": [0, 5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 21], "base64": 3, "basi": 3, "basic": [16, 17], "beat": [0, 3], "becaus": [11, 15, 16], "becom": [6, 12, 15], "been": [7, 11, 12, 14, 21], "befor": 11, "began": 12, "begin": 17, "behind": 11, "being": [5, 15], "believ": 14, "below": [3, 4], "benchmark": 0, "benefit": [7, 14], "beneto": [0, 14], "bengio": 0, "benjamin": 0, "benno": 0, "berg": [0, 14], "bert": [0, 4], "bertin": 0, "best": 14, "bethard": 0, "better": 4, "between": [4, 5, 7, 11, 12, 14, 16, 21], "beyond": [0, 6, 16, 17], "bias": 11, "bidirect": 0, "bittner": 0, "blend": 12, "bleu": 4, "bleu_1": 4, "block": 13, "blog": 0, "blue": [3, 12, 15], "bodganov": 0, "bogdanov": 0, "book": 14, "boolean": 17, "borrow": 4, "borso": 0, "bosma": 0, "both": [6, 11, 21], "bottleneck": [12, 13], "boyer": 0, "brahma": 0, "brandon": 0, "brass": 3, "break": 0, "breakthrough": 12, "breviti": 4, "brian": [0, 16], "bridg": [0, 12, 17], "briefli": [4, 15], "bright": 15, "broad": 12, "brockman": 0, "browser": [2, 9, 19], "bryan": 0, "build": 14, "built": 21, "burgeon": 14, "byte": 3, "byted": 14, "c": [0, 6, 15, 16], "ca": 14, "cacul": 11, "caillon": 0, "calcul": 11, "california": 14, "call": 15, "can": [3, 5, 6, 7, 11, 14, 15, 21], "candid": [4, 14], "cannot": [7, 16], "capabl": [12, 14, 16], "caption": [0, 4, 5, 6, 14, 15], "captur": [4, 5, 16], "carol": 0, "carrol": 0, "cascad": 16, "case": [3, 5, 7, 11], "casual": 3, "cat": 15, "categor": 7, "categori": 7, "cater": 14, "cc": 3, "cd": 14, "cell": 15, "celma": 0, "centr": 14, "cfs16": [0, 6], "cfsc17": [0, 7], "chaganti": 0, "challeng": [0, 12, 14, 16, 17], "chang": [0, 17], "changli": 0, "channel": 16, "chao": 0, "chapter": [15, 17, 21], "characterist": [4, 7], "chartmetr": 14, "chatgpt": 15, "cheat": 11, "check": 15, "chelsea": 0, "chen": [0, 14], "chenshuo": 0, "child": 0, "chl": [0, 17], "cho": 0, "choi": [0, 6], "choic": [4, 6], "chong": 0, "choppi": 3, "choral": 12, "chord": 3, "chosen": 7, "chou": 0, "chri": 0, "christin": 0, "chu": 0, "chung": 0, "cider": 4, "cinjon": 0, "cite": 7, "citep": [], "citi": 0, "ckg": [0, 17], "clap": 11, "clark": 0, "class": 15, "classic": [3, 15], "classif": [0, 7, 11, 16, 17, 21], "classifi": 7, "clich\u00e9": 12, "clip": [7, 14, 15, 17], "clone": 14, "close": [6, 12], "closer": 3, "closest": 16, "clpn19": [0, 17], "clz": [0, 17], "cn24": [0, 11, 12, 13], "cnn": 6, "co": [11, 14], "coars": 7, "coca": [0, 15], "code": [14, 17], "codec": [11, 17], "colin": 0, "collect": [0, 5], "colleg": 14, "color": 15, "com": 14, "combin": [4, 11, 13, 14], "come": 4, "common": 4, "commonli": [4, 11], "commun": [14, 15, 16], "compar": [4, 5, 11, 12, 15, 16, 17], "comparison": 4, "complement": 11, "complet": 14, "complex": [5, 7, 11, 12, 15, 16, 17], "compon": [4, 17], "compos": [6, 12], "composit": [12, 14], "comprehend": 14, "comprehens": [11, 14], "compris": [], "comput": [0, 4, 11, 12, 14], "concaten": [6, 13], "concept": [7, 12, 15, 17], "conda": 14, "condit": [7, 13, 16, 17], "confer": [0, 14], "confid": 11, "cong": 0, "connect": 16, "consensu": 4, "consequ": 6, "consid": [4, 11], "consist": [3, 12, 13, 14, 16], "consistut": [], "consolid": 6, "constant": 3, "constraint": 12, "consum": 11, "contain": [3, 7, 14, 21], "content": [0, 5, 7, 14, 21], "context": 6, "contextu": [4, 15], "continu": [4, 11, 13], "contrast": [0, 11, 15], "contribut": [12, 14], "control": [0, 3, 12, 14, 16], "controlnet": 0, "convei": [7, 12], "conveni": 4, "convent": 14, "convers": [0, 7, 11, 14, 17, 21], "convert": [13, 17, 21], "convolut": [0, 13], "cook": 0, "copet": 0, "copyright": 3, "core": [3, 6], "corner": 11, "corpora": 15, "corpu": 0, "correl": 4, "correspond": [7, 11], "cosin": [4, 11], "cost": 11, "costli": 21, "could": 16, "coupl": 5, "courvil": 0, "cover": [5, 7, 14, 15, 16], "cpp": [], "cpu": [], "creat": [3, 4, 5, 11, 14, 16], "create_audio_html": 3, "creation": [12, 14, 17], "creativ": [12, 16], "criteria": 11, "cross": [6, 13], "crowdsourc": 3, "crucial": [7, 11], "csl": 12, "csrc": [], "cue": [12, 17], "curat": 0, "current": 12, "cut": 17, "cvf": 0, "cvpr": [0, 14], "cvpr52729": 0, "cwl": [0, 17], "cxz": [0, 7], "cyril": 0, "d": [0, 14], "dai": [0, 3], "dall": 17, "danc": [3, 15], "danceabl": 3, "daniel": 0, "dario": 0, "dark": 15, "data": [0, 3, 5, 11, 12, 15, 16, 21], "databas": [16, 21], "datafram": 3, "dataset": [0, 5, 14, 15, 16, 17, 21], "date": 12, "david": 0, "dcln23": [0, 6, 17], "dclt18": [0, 17], "ddsp": 12, "deaf": 5, "decemb": 0, "decod": [3, 13, 16, 17], "deconvolut": 13, "deep": [0, 6, 7, 11, 12, 14, 16], "deepbach": 12, "deeper": 14, "deepli": 16, "deepmind": 14, "def": 3, "defin": 17, "definit": 17, "dehghani": 0, "delv": [14, 17], "demo": 0, "demonstr": [13, 15, 17], "den": 0, "deng": 0, "dengsheng": 0, "denk": 0, "denot": 7, "depart": 14, "departur": 14, "depend": [4, 7, 11, 14], "deploi": 14, "depth": 17, "deriv": [3, 11, 14], "describ": [0, 5, 6, 7, 15, 21], "descript": [0, 12, 14, 15, 17, 21], "descriptor": 7, "deshmukh": 0, "desideatum": 11, "design": [4, 6, 11], "desir": [16, 21], "despit": 16, "desw23": [0, 7], "detail": [3, 7, 11, 12, 14, 15, 17], "develop": [0, 6, 7, 11, 12, 14, 15, 16, 17], "devic": [], "devlin": 0, "df": 3, "dhabi": 0, "dhariw": 0, "dialog": 17, "dialogu": [3, 14], "did": 6, "diego": 14, "dieleman": [0, 16], "differ": [4, 5, 6, 7, 11, 13, 16], "differenti": 6, "difficulti": 15, "diffus": [0, 12, 15, 17], "diogo": 0, "direct": [5, 14], "directli": [3, 12, 14], "discount": 4, "discoveri": 14, "discret": [13, 17], "discrimin": [13, 16], "discuss": [11, 14, 17], "displai": [2, 3, 9, 19], "distinct": [11, 12], "distinguish": [0, 5, 6], "distribut": [3, 11, 16], "ditto": [0, 14], "diverg": 11, "divers": [14, 17], "djp": [0, 17], "dl": 6, "dmitri": 0, "dml": [0, 6, 7], "dmp18": [0, 16], "do": 14, "docnam": [], "doctor": 14, "document": 0, "doe": [2, 4, 7, 9, 11, 19], "doh": [0, 14], "doi": 0, "domain": [11, 12, 13, 14, 15, 16, 17], "donahu": 0, "done": 4, "dong": 0, "dot": [7, 11], "dougla": 0, "downstream": [14, 15], "draw": 11, "drive": 12, "driven": [16, 17], "du": 0, "dubnov": 0, "due": 12, "duh": 0, "durand": 0, "durat": [2, 9, 12, 19], "dure": [11, 14, 17], "dwcn23": [0, 15, 17], "dynam": 12, "e": [0, 3, 6, 7, 15, 17], "each": [4, 7, 11, 16, 21], "earli": [6, 12, 21], "earlier": 6, "earliest": 6, "easier": [5, 16], "easili": 16, "ecal": 3, "eck": 0, "edg": 17, "edit": 14, "editor": 0, "educ": [5, 14], "eess": 0, "effect": [0, 3, 11, 15], "effici": [0, 12, 14], "effort": 7, "either": [4, 7], "elabor": 11, "elbmg07": [0, 16], "electron": [3, 15], "element": [2, 7, 9, 19], "elena": 0, "elio": [0, 14], "elizald": 0, "elli": 0, "els": 15, "embed": [0, 4, 6, 11, 13, 15, 17, 21], "embeddings_2d": 15, "emerg": [6, 7, 12, 14], "emir": 0, "emmanouil": [0, 14], "emnlp": 0, "emot": [0, 12], "emphas": 17, "emphasi": [14, 16], "empir": 0, "emploi": [6, 13], "enabl": [3, 5, 12, 14, 15, 17], "encod": [12, 13, 16, 17], "encodec": 13, "encompass": 7, "encourag": 14, "end": [0, 3, 14], "engag": 7, "engel": 0, "engin": [14, 16], "english": 3, "enhanc": [0, 12, 17], "enjoi": 7, "enorm": 21, "ensur": 11, "enter": 16, "entir": 11, "entropi": [11, 13], "enumer": 15, "environ": 14, "epur": 0, "era": [7, 14], "erik": 0, "err": [0, 16], "escap": 3, "especi": 7, "essenti": [11, 17], "establish": [7, 14], "et": 6, "etc": 15, "evalu": [0, 12, 14, 17], "even": [11, 21], "event": 7, "evolut": [7, 17], "evolv": [12, 16], "exact": 4, "examin": 17, "exampl": [2, 5, 6, 7, 12, 13, 14, 15, 17], "excel": [11, 15], "except": [], "excit": 15, "exemplifi": 12, "exercis": 17, "exist": [3, 15, 16], "expand": [3, 12], "experi": 0, "expertis": 14, "explain": [14, 21], "exploit": [], "explor": [0, 7, 12, 14, 17], "express": [7, 12, 15], "expressivenss": 12, "extend": [0, 3, 7], "extract": 3, "extractor": 6, "f": [3, 4, 7], "f_": 6, "face": [12, 21], "facilit": 14, "fail": [], "fall": [], "fals": 3, "far": 7, "fast": 3, "fatigu": 11, "favour": 6, "fazeka": [0, 14], "featur": [0, 3, 5, 6, 13, 16], "fedu": 0, "feedback": [0, 11], "feel": 3, "felix": 0, "few": [], "field": [12, 14], "figsiz": 15, "figur": [12, 13, 15], "filip": 0, "fill": [], "film": 5, "filter": 16, "filterwarn": 15, "final": [4, 11, 12, 14], "find": [0, 5, 16, 21], "fine": [0, 7, 14], "finetun": 0, "firmli": 14, "first": [6, 11, 12, 14, 16], "fit": 6, "fit_transform": 15, "fix": [7, 16], "flat": 11, "flexibl": [7, 15], "florencia": 0, "fltz10": [0, 16], "focu": [7, 12, 14, 16], "focus": [3, 4, 7, 12, 14, 15, 16, 17], "follow": [0, 11, 12, 13, 16, 17], "fontsiz": 15, "forgo": 6, "form": [3, 5, 6, 7, 15], "formul": 17, "forum": 0, "forward": [12, 13], "fossez": 0, "foster": 14, "foundat": [0, 14, 15], "framework": [6, 14, 17], "francisco": 14, "from": [0, 3, 4, 6, 11, 12, 13, 14, 15, 16, 17, 21], "frozen": 6, "fu": 0, "full": [7, 12], "fulli": 3, "function": [3, 12, 13], "fundament": 16, "further": [12, 14, 17], "furthermor": [15, 17], "fuse": 6, "fusion": 6, "futga": 0, "futur": [14, 16], "g": [0, 3, 6, 7], "gabbolini": 0, "gabriel": 0, "ganti": 0, "gao": 0, "gap": [], "gardner": 0, "gat": 0, "gdsb23": [0, 6, 7, 15, 17], "ge": 0, "gener": [0, 3, 4, 5, 11, 12, 13, 14, 15, 21], "genr": [0, 3, 7, 12, 15, 16], "geoffroi": 0, "geometr": 4, "georg": [0, 14], "geq": [], "gert": 0, "ghe22": [0, 7], "giovanni": 0, "girish": 0, "git": 14, "github": 14, "give": [3, 7], "given": [7, 11, 12, 16, 21], "global": 7, "gltq23": [0, 6, 7], "go": [7, 12, 17], "goal": [5, 7], "goh": 0, "goldberg": 0, "gomez": 0, "good": 11, "googl": [12, 14], "gpt": [0, 3, 14], "grai": 0, "grain": [0, 7, 14], "gram": 4, "graph": 4, "grave": 0, "greater": 12, "greatest": 15, "green": [0, 12, 15], "greg": 0, "grew": 16, "ground": [3, 4], "groundwork": 16, "group": [], "grown": 12, "guangzhi": 0, "guid": [0, 12], "guidanc": 14, "guitar": [3, 15], "gulrajani": 0, "guo": 0, "guojun": 0, "guu": 0, "gy": 0, "gy\u00f6rgi": 0, "ha": [3, 6, 7, 11, 12, 14, 16, 21], "hai": 0, "hallaci": 0, "hand": 7, "handl": [15, 16], "hao": 0, "haoh": 0, "happen": 6, "happi": [3, 15], "hard": 5, "harmon": 4, "hat": [], "have": [3, 5, 7, 11, 12, 14, 16], "he": 14, "hear": [0, 5], "heard": 3, "heavi": 3, "heewoo": 0, "heiga": 0, "helena": 0, "help": [0, 11], "hennequin": 0, "her": 14, "herald": 14, "here": [], "herrera": 0, "hi": 14, "high": [6, 7, 11, 15, 16], "higher": [4, 5, 12], "highli": 11, "highlight": [14, 17], "histori": 7, "hjl": [0, 15, 17], "hlss23": [0, 6, 7], "hot": [15, 16, 17], "hou": 0, "how": [0, 4, 5, 14, 17], "howev": [7, 11, 14, 16, 21], "hsuan": 0, "html": 3, "http": [0, 14], "huam": 0, "huang": 0, "huge": 15, "hum": 3, "human": [0, 3, 4, 5, 7, 12, 14, 16], "hussain": 0, "hybrid": 6, "hyung": 0, "i": [0, 3, 4, 6, 7, 11, 12, 13, 14, 16, 21], "icassp": [0, 14], "icassp48485": 0, "icml": 14, "id": [0, 3], "idea": 15, "identifi": 11, "idf": 4, "ieee": [0, 14], "ieeexplor": 0, "ignor": 15, "ijcnn": 0, "ijcnn54540": 0, "ilaria": [0, 14], "ilg": 0, "illustr": [12, 13, 17], "ilya": 0, "imag": [0, 4, 6, 11, 15, 16, 17], "impact": 15, "imperi": 14, "import": [2, 3, 9, 15, 19], "impress": 12, "improv": 12, "includ": [4, 11, 12, 13, 14, 16, 17], "inclus": [11, 14], "incorpor": [7, 12, 13, 17], "indi": 3, "indic": 11, "indulg": 14, "infer": [0, 15], "influenc": [6, 7, 17], "inform": [0, 12, 14, 15, 16], "initi": [12, 16, 21], "innov": [12, 14], "input": [7, 11, 12, 13, 16], "insight": 11, "inspir": 12, "instal": 14, "instead": [4, 5, 6, 7], "institut": 16, "instruct": [0, 17], "instrument": [0, 3, 7, 12, 15, 16], "integr": [11, 12, 17], "intellig": 14, "intend": 12, "intent": 15, "interact": [12, 14], "interest": [5, 14, 16], "interfac": 14, "intermedi": 6, "intern": [0, 14], "introduc": [12, 13, 17, 21], "introduct": 17, "intuit": 15, "invalu": 11, "investig": 17, "involv": 16, "ipd": [2, 9, 19], "ipython": [2, 3, 9, 19], "ishaan": 0, "ismir": [0, 14], "ismir2008": 16, "ismir2019": 16, "ismir2021": 16, "isn": [3, 4], "issn": 0, "issu": 4, "itai": 0, "item": [0, 3, 15], "iter": 6, "its": [7, 11, 13, 14, 15, 16], "j": 0, "jack": 0, "jacob": 0, "jacquelin": 0, "jade": 0, "jain": 0, "jamendo": 3, "janko": 0, "jansen": 0, "jason": 0, "jeffrei": 0, "jeong": 0, "jess": 0, "jiaheng": 0, "jiahui": 0, "jiajia": 0, "jiang": 0, "jin": 0, "jingren": 0, "jiyoung": 0, "jnmr": 0, "joint": [0, 17, 21], "jong": [0, 14], "jongpil": 0, "joonseok": 0, "jose": 0, "josef": 0, "josh": 0, "journal": 0, "journei": 16, "joy": 3, "judg": 4, "judgement": 4, "judith": 0, "juhan": [0, 14], "jukebox": [0, 12, 14], "julian": [0, 14], "jun": 0, "junda": 0, "june": 0, "jupyt": 14, "just": [3, 11], "justin": 0, "k": [7, 13], "kai": 0, "kaist": 14, "kakao": 14, "kalchbrenn": 0, "kant": 0, "karen": 0, "katarina": 0, "katherin": 0, "kavukcuoglu": 0, "ke": [0, 14], "keepdim": 15, "kei": [3, 6, 11, 12, 13, 16, 17], "kelvin": 0, "kenton": 0, "keunwoo": 0, "kevin": 0, "keyword": 12, "kim": [0, 14], "kirkpatrick": [0, 14], "kl": 11, "knowledg": [0, 15], "korai": 0, "kozareva": 0, "kreuk": 0, "kristina": 0, "krisztian": 0, "ksm": [0, 7], "kullback": 11, "kumar": 0, "kundan": 0, "kyunghyun": 0, "l": [4, 7], "l1": 13, "l2": 13, "lab": 14, "label": [0, 5, 11, 16], "laid": 16, "lam08": [0, 16], "lama": 0, "lamer": 0, "lanckriet": 0, "languag": [0, 3, 5, 6, 7, 11, 12, 15, 16], "larg": [0, 5, 6, 7, 17, 21], "last": [5, 15], "late": [0, 6], "latent": [12, 13], "later": [6, 12], "latest": [5, 7, 14], "latter": 7, "launch": 14, "laurier": 0, "ldot": [6, 7], "le": 0, "lead": [4, 11, 12], "learn": [0, 5, 6, 7, 11, 12, 14, 16, 21], "learner": [0, 15], "led": 12, "lee": 0, "legend": 15, "legg": 0, "leibler": 11, "len": 15, "length": [3, 4, 7], "leoni": 0, "less": 11, "lester": 0, "leszczynski": 0, "let": [3, 6, 7, 15], "letter": 0, "level": [0, 4, 5, 6, 7, 11, 15], "leverag": [11, 12, 13, 15], "lexic": 4, "lhss24": [0, 6, 7], "li": [0, 11], "lib": [], "licens": 3, "lick": 3, "light": 6, "like": [6, 7, 12, 14, 15, 17], "likelihood": 16, "limit": [0, 7, 12, 15, 16], "lin": 0, "linalg": 15, "line": 15, "line2d": 15, "linear": 4, "linguist": 0, "lior": 0, "listen": [3, 16], "literatur": [4, 6], "liu": 0, "live": 3, "llama": 0, "llark": 0, "llm": [0, 3, 4, 7, 17], "lm": [12, 17], "ln17": [0, 7], "load_dataset": 3, "localis": 7, "logit": 16, "london": 14, "long": [0, 7, 11, 16, 21], "longer": [4, 7], "longest": 4, "longpr": 0, "look": [6, 7, 17], "loss": 13, "lost": [], "lot": 15, "low": [11, 15], "lower": 11, "lp": [0, 3], "lsp": 0, "lu": 0, "luan": 0, "luke": 0, "lun": 0, "lyric": 15, "lyt": [0, 4], "m": [0, 14], "ma": 0, "maarten": 0, "machin": [0, 4, 5, 6, 12, 14, 16, 17], "maestro": 0, "magazin": 0, "magenta": 12, "magnatagatun": 3, "magnatagtun": [], "mahieux": 0, "mai": [3, 11, 12], "main": [0, 3, 6], "major": 4, "make": [3, 6], "male": 3, "manco": [0, 6, 14], "mani": [3, 7, 15], "manifold": 15, "map": [6, 7, 12], "marco": 0, "mard": 3, "mari": 14, "marianna": 0, "mark": [0, 12], "marker": 15, "markerfacecolor": 15, "markers": 15, "mask": [12, 17], "massachusett": 16, "massiv": 0, "match": [11, 21], "matena": 0, "materi": 14, "mathbf": [], "mathcal": 7, "matplotlib": 15, "matrix": 11, "mauro": 0, "mbqf21": [0, 6, 7, 17], "mbqf22": [], "mbqf22a": [0, 17], "mbqf22b": [0, 15], "mcaulei": [0, 14], "mckee": 0, "mcleavei": 0, "mean": [4, 6, 7, 15, 16], "meaning": [4, 21], "measur": [4, 11], "mechan": [6, 13, 14], "megan": 0, "mehri": 0, "mel": 13, "melod": 12, "melodi": [3, 12], "member": 14, "mention": 15, "metadata": 15, "meteor": 4, "method": [0, 11, 13, 15, 16, 17], "methodologi": [14, 16, 17], "metric": [11, 17], "mexico": 0, "michael": 0, "michigan": 14, "midinet": 12, "might": 12, "migneco": 0, "mikhail": 0, "miller": 0, "million": 3, "mimic": 6, "min": 3, "ming": 0, "mingni": 0, "minz": 0, "mir": [7, 14, 15], "mishkin": 0, "mislead": 11, "mit": 16, "mitsubishi": 14, "mixup": 0, "mkg": [0, 16], "modal": [0, 6, 11, 12, 14], "model": [0, 2, 4, 5, 7, 11, 12, 13, 14, 15, 16, 21], "modul": [6, 11, 13, 15], "modulenotfounderror": 15, "moham": 0, "mohammad": 0, "mojtaba": 0, "mood": [3, 7, 15, 16], "mor": 0, "more": [3, 6, 7, 11, 12, 14, 15, 16, 17], "morton": 0, "most": [7, 11, 15, 16], "mostafa": 0, "mostli": 7, "msci": 14, "msd": 3, "mssr23": [0, 3], "mtc": [], "mtg": 3, "mtt": 3, "much": [4, 6], "muchomus": [0, 4], "muedit": 3, "mulab": 14, "mulan": [0, 15], "mulap": 15, "multi": [0, 7, 11, 12, 13, 14, 16, 17], "multiclass": 21, "multilabel": 21, "multimedia": 0, "multimod": [0, 14, 17], "multipl": [0, 4, 7, 15], "multitask": [0, 17], "multitrack": 12, "muscap": 0, "musegan": 12, "music": [0, 11, 12, 13, 15], "musicbench": 3, "musiccap": 0, "musicgen": 12, "musicgenerationtempl": 12, "musician": 3, "musicinstruct": 3, "musicldm": [0, 12], "musiclm": [0, 12], "musicqa": 3, "musictextclip": [], "musicva": 12, "musilingo": 0, "mwd": [0, 3], "mwpt18": [0, 16], "my": [], "n": [0, 4, 6], "n_compon": 15, "naacl": 0, "nal": 0, "nam": [0, 14], "namburi": 0, "name": [12, 14, 15, 21], "nameerror": [], "nan": 0, "narang": 0, "nativ": 6, "natur": [0, 3, 5, 6, 7, 12, 16], "navercorp": 14, "navig": 5, "nc": 3, "ncl": [0, 16, 17], "ncsoft": 14, "necessarili": 11, "need": [11, 14, 15, 16, 21], "net": 0, "network": [0, 6, 12, 13, 16], "neural": [0, 6, 11, 12], "neurip": 14, "new": [0, 7, 12, 13, 14, 16, 17], "next": [7, 13, 14], "nezhurina": 0, "nichola": 0, "nlp": 4, "nmbkb24": [0, 17], "nn": [], "no_grad": 15, "noam": 0, "noisi": 15, "non": [4, 12], "norm": 15, "norouzi": 0, "notabl": [11, 12], "note": [0, 3, 12], "notebook": [2, 14], "novack": [0, 14], "novel": [15, 17, 21], "novelti": 0, "novemb": 14, "now": [3, 7, 14], "np": 15, "nsynth": [12, 16], "nuanc": [7, 15], "number": [4, 11, 21], "numpi": [2, 9, 15, 19], "o": 15, "object": [4, 11, 13], "obtain": [4, 11, 14], "occupi": 5, "occur": 21, "octob": 0, "offer": [11, 14, 17], "often": [3, 6, 7, 11, 15], "one": [3, 6, 7, 13, 15, 16, 17, 21], "onli": [3, 4, 6, 7, 14, 16, 21], "onlin": 14, "oord": 0, "open": 21, "openai": [0, 3, 12, 14], "openreview": 0, "optim": 0, "order": 7, "org": 0, "organis": 5, "origin": [3, 4, 11, 12], "oriol": 0, "orthogon": 15, "oscar": 0, "other": [0, 3, 5, 7, 11, 12, 15], "otherwis": [], "our": [7, 14], "out": [15, 17, 21], "output": [4, 7, 11, 12, 16], "outsid": [14, 21], "ouyang": 0, "over": 16, "overal": 11, "overlap": 4, "overview": 14, "owj": [0, 17], "p": [0, 6, 7, 16], "p310": 14, "p_1": 4, "p_i": [], "p_n": 4, "packag": [], "page": 14, "pair": [3, 5, 11, 15], "pamela": 0, "panda": 3, "pandora": 14, "pann": 11, "paper": [14, 15], "paradigm": 12, "parallel": 14, "park": 0, "pars": 4, "part": [13, 16], "partial": 5, "particip": [11, 14], "particular": 14, "particularli": [5, 6, 11, 12, 14, 16], "pass": 6, "passion": 14, "path": 3, "patrick": 0, "pattern": [0, 16], "paul": 0, "pave": [12, 14], "pavlov": 0, "payn": 0, "pd": 3, "peeter": 0, "penalti": 4, "pengi": 0, "peopl": 15, "perceiv": 11, "percept": 11, "percuss": 14, "perfecto": 0, "perform": [7, 11, 12, 14, 15], "perplex": 15, "personalis": 5, "perspect": 11, "pertin": 17, "peter": 0, "ph": 14, "phase": [16, 17], "phbd03": [0, 7], "phd": 14, "philip": 0, "phrase": 15, "physic": 14, "piano": [14, 15], "pianotreeva": 12, "piec": [7, 11, 16, 21], "ping": 0, "pink": 12, "pip": 14, "pipelin": 6, "pitch": [12, 16], "play": 3, "playback": 5, "playlist": [0, 7], "pleas": 16, "plt": 15, "pmlr": 0, "point": 11, "polyak": 0, "polyfuss": 12, "poor": 11, "popular": 6, "posit": [3, 5, 15], "possess": 15, "possibl": [5, 7, 11, 12], "possibli": 3, "potenti": [11, 14], "power": [0, 15], "practic": [5, 14, 17], "prafulla": 0, "pre": [0, 2, 4, 6, 7, 15], "preced": 13, "precis": [4, 14], "predefin": [4, 7], "predict": [7, 12, 13, 16], "prefer": [0, 14], "prefix": 13, "preprint": 0, "preprocess": 13, "present": [11, 14, 17], "pretrain": [0, 11, 13, 14, 15], "previous": [14, 21], "primarili": [12, 14, 16], "principl": 12, "prior": 7, "probabl": [12, 13], "problem": [15, 16, 17, 21], "proc": 0, "proceed": 0, "process": [0, 6, 11, 12, 13, 14, 16], "prod_": [6, 7], "produc": [5, 6, 7], "product": [11, 17], "progress": [3, 17], "project": 12, "prompt": [0, 17], "propag": 0, "propos": 7, "provid": [4, 11, 14], "pseudo": [0, 15], "public": [], "publish": 14, "puckett": 0, "pure": 17, "purpos": [5, 11, 12], "pw": 0, "py": [], "pyplot": 15, "python": 14, "python3": [], "pytorch": [], "q": 13, "qi": 0, "qian": 0, "qingq": 0, "qualiti": [12, 16, 17], "quantiz": 13, "queen": 14, "queri": [0, 5, 13, 15, 17, 21], "question": [0, 2, 3, 4], "quinton": [0, 14], "quoc": 0, "qwen": 0, "r": 14, "r_1": 4, "rachel": 0, "radford": 0, "radlinski": 0, "raffel": 0, "rai": 0, "ramesh": 0, "randn": [2, 9, 19], "random_st": 15, "rang": [5, 6, 16, 17], "rap": 15, "rapid": 14, "rapidli": 12, "rate": [2, 9, 11, 19], "rather": [11, 15, 16], "ravi": 0, "raw": 0, "raymond": 0, "re": [5, 16], "readabl": 5, "real": [11, 12, 14, 16], "realiti": 15, "reaon": 11, "reason": 7, "recal": 4, "receiv": 6, "recent": [6, 7, 12, 14, 15], "recogn": 11, "recognis": [], "recognit": [0, 16], "recommend": [0, 5, 11], "reconstruct": 13, "recurr": [0, 12], "reduc": 4, "refer": [0, 14, 16], "reflect": [11, 12, 21], "regress": [13, 17], "rel": [7, 15, 16], "relat": [14, 15, 21], "relationship": [4, 7, 16], "relax": 3, "relev": [14, 17, 21], "reli": [6, 11, 12], "remain": 15, "remark": 14, "remez": 0, "remi": 12, "renum": 3, "repeat": 3, "repo": [], "report": [0, 15], "repositori": 14, "repres": [11, 12, 15, 16], "represent": [0, 4, 6, 11, 14], "request": 15, "requir": [5, 6, 11, 14], "research": [0, 7, 11, 12, 14, 16], "residu": 13, "resnick": 0, "resolut": 13, "resourc": 5, "respect": 7, "respons": [0, 11, 15, 21], "respos": 6, "rest": 7, "result": [4, 6, 11, 16, 21], "retriev": [0, 5, 12, 14, 15], "return": 3, "review": [0, 3, 4, 7, 15], "rewon": 0, "rgy": 0, "rhythm": 3, "rich": 15, "richardson": 0, "richer": 16, "riff": 3, "riffus": 12, "right": 3, "rightarrow": 7, "rise": 16, "rita": 0, "rithesh": 0, "rkh": [0, 15, 17], "rkx": [0, 17], "rnn": [6, 12], "robert": 0, "robust": 0, "rock": [0, 3, 15], "role": 17, "roll": 3, "romain": 0, "rongchen": 0, "roug": 4, "rouge_l": [], "round": 14, "rpg": [0, 17], "rsr": [0, 17], "run": 14, "runner": [], "runtimeerror": [], "russel": 0, "rvq": 13, "rwc": [0, 17], "s4": 12, "sa": [], "sakkeer": 0, "salamon": 0, "salient": 14, "salmonn": 0, "sam": 0, "same": [11, 15], "sampl": 11, "samplernn": [0, 12, 16], "san": 14, "sander": [0, 16], "sandhini": 0, "sandler": 0, "sashimi": 12, "sastri": 0, "satisfact": 15, "scale": [0, 11], "scatter": 15, "scenario": 14, "scene": 4, "scheme": 6, "schmidt": 0, "scienc": 14, "scope": 15, "score": 4, "scott": 0, "sdd": [], "search": [5, 14, 16, 17, 21], "sec": 3, "second": [11, 16], "section": [5, 12, 13, 15], "see": [4, 5], "seek": [7, 14], "seen": [3, 15], "segment": 7, "select": 11, "self": [14, 15], "semant": [0, 4, 15, 17, 21], "semi": 0, "senior": 0, "sensit": 11, "sentenc": [4, 6, 7, 12, 15, 17], "sentence_transform": 15, "sentencetransform": 15, "separ": [6, 14], "sequenc": [6, 7], "seri": 14, "serra": 0, "serv": [11, 13, 14, 15], "server": 14, "session": [0, 11], "set": [0, 3, 4, 7, 11, 21], "setup": 11, "seungheon": [0, 14], "seventh": 0, "sever": [5, 6, 11], "seyedhosseini": 0, "shan": 0, "shansong": 0, "sharan": 0, "share": [6, 16], "shayn": 0, "shazeer": 0, "she": 14, "shift": [7, 16], "shih": 0, "shiliang": 0, "shinji": 0, "shlomo": 0, "short": 16, "shot": 0, "should": 14, "show": [15, 21], "shown": 12, "shu": 0, "shubham": 0, "shyamal": 0, "siddhartha": 0, "sigir": 0, "signal": [0, 11, 12, 13, 14], "signatur": 12, "signifi": 14, "signific": 14, "similar": [0, 4, 6, 11], "similarli": 4, "simon": 0, "simonyan": 0, "simpl": [0, 12, 17], "simplest": 7, "simpli": [], "simultan": [12, 15], "sinc": [11, 12], "sing": 14, "singer": 3, "singh": 0, "singl": [0, 7, 11, 15, 16], "site": [], "situat": 11, "sivic": 0, "size": [3, 7], "sketchnet": 12, "sklearn": 15, "slama": 0, "slc07": [0, 16], "small": [7, 11], "sne": 15, "so": [4, 7], "social": [0, 15, 16], "societi": [0, 14], "soft": [3, 15], "softwar": 14, "soham": 0, "solid": 14, "solo": 3, "solv": [15, 21], "some": [5, 6, 7, 12, 16], "sometim": [6, 7], "song": [0, 21], "soni": [12, 14], "soon": 12, "sordo": 0, "soroush": 0, "sotelo": 0, "sound": [0, 3, 15, 17], "sourc": [3, 14, 16], "southern": 14, "space": [6, 12, 15], "span": [5, 14, 17], "special": 15, "specif": [6, 7, 11, 12, 14, 16, 17], "speck": 0, "spectrogram": 13, "spectrum": 12, "speech": [0, 11, 12, 16, 17], "spice": 4, "spider": 4, "spm": 14, "spotifi": 14, "sr": [2, 9, 19], "src": 3, "stabl": [14, 15], "stableaudio": 12, "staff": 14, "stage": 6, "start": [5, 6, 7, 12, 16], "startup": 12, "state": [0, 6, 12], "static": 14, "statist": 11, "step": [11, 13], "stephen": 0, "steven": 0, "stft": 13, "still": [11, 15], "stimulu": 11, "stoller": 0, "straightforward": 11, "strategi": 0, "strength": 11, "structur": [7, 12], "student": 14, "studi": [14, 16, 21], "style": [7, 12, 16], "sub": 7, "subject": [3, 11], "subsequ": [4, 12], "substanti": 14, "subtl": 11, "success": 6, "suitabl": [3, 11], "summar": 11, "summari": [5, 7], "summaris": 4, "sun": 0, "sunni": 3, "suno": 12, "superior": 11, "supervis": [0, 7, 14, 16, 17, 21], "supplement": [3, 14], "support": [2, 5, 9, 16, 19], "survei": [0, 14], "sutskev": 0, "symbol": 12, "synchron": 0, "synnaev": 0, "syntact": 4, "synth": 3, "synthes": 12, "synthesi": [0, 12], "synthet": 5, "system": [0, 4, 5, 6, 7, 11, 15, 21], "szu": 0, "t": [0, 3, 4, 6, 7, 15], "t5": [13, 15], "tag": [0, 3, 7, 15, 16], "tagliasacchi": 0, "tai": 0, "taigman": 0, "take": [4, 6, 13], "tal": 0, "tan": 0, "tang": 0, "tao": 0, "target": [6, 13], "task": [0, 4, 5, 6, 11, 12, 13, 14, 17, 21], "taslp": 14, "taylor": [0, 14], "tbtl08": [0, 16, 17, 21], "tc02": [0, 7], "tea": 3, "teach": [0, 14], "technic": [0, 14], "techniqu": [6, 11, 14, 17], "technologi": [14, 16, 17], "tejasvi": 0, "telecommun": 11, "templat": [0, 3], "tempo": [3, 15, 16], "tempor": [0, 7], "tencent": 14, "tensor_numpi": [], "term": [4, 5, 7, 11, 15, 16, 17], "tester": 11, "text": [0, 4, 5, 7, 12, 13, 14, 15, 17], "textsubscript": 4, "textual": [11, 12, 14, 17], "tf": 4, "than": [7, 11, 14, 15, 16], "thank": 15, "thei": [4, 5, 6, 15], "them": [3, 4, 11, 12], "theme": 16, "therebi": 14, "therefor": [7, 11], "thesi": 14, "theta": [], "thi": [2, 3, 4, 5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 21], "thierri": 0, "think": 7, "thirti": 0, "those": 7, "three": [3, 11, 12], "through": [0, 2, 4, 5, 7, 12, 14, 15, 16, 17], "throughout": 3, "thu": 14, "ti": 6, "tian": 0, "tie": 0, "tier": 14, "tight_layout": 15, "timbr": 15, "time": [0, 7, 11, 12, 13, 16, 21], "timelin": 12, "timescal": 7, "timo": 0, "ting": 0, "titl": 15, "to_html": 3, "todai": 6, "todo": [], "token": [4, 6, 7, 13], "too": 21, "tool": [12, 17], "top": 14, "topic": [12, 14], "torch": [2, 9, 15, 19], "torr": 0, "toutanova": 0, "tovstogan": 0, "toward": [0, 7, 16], "trace": 17, "traceback": 15, "track": [3, 6, 7, 16, 21], "tradit": 12, "tradition": 7, "train": [0, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17], "transact": 0, "transcript": 14, "transfer": [0, 12], "transform": [0, 3, 12, 13, 15], "transit": 17, "translat": [0, 4, 5, 6, 16], "treat": 6, "trend": 12, "trigger": [], "true": 15, "truth": [3, 4], "try": [15, 21], "tsa": 0, "tsne": 15, "ttmr": 15, "turbo": 3, "turn": [7, 17], "turnbul": 0, "tutoir": [], "tutori": [5, 7, 12, 14, 15, 16], "twelfth": 0, "two": [5, 6, 11, 12, 13, 16], "txt": 14, "ty": [0, 7], "type": [3, 5, 6, 7, 11, 12], "typic": [4, 6, 7, 11], "tzanetaki": 0, "ucsd": 0, "udio": 12, "ugen": 0, "ultim": 11, "umbrella": 5, "umg": 14, "unannot": 21, "uncondit": [0, 16], "under": [14, 21], "understand": [0, 11, 12, 14, 15, 16, 17], "unfortun": 21, "unifi": [0, 6, 15], "unigram": 4, "uniqu": 11, "unit": [0, 4], "univers": [0, 14, 16], "unlabel": 5, "unlik": 11, "unresolv": 14, "unrol": 6, "unseen": 16, "unsupervis": 0, "up": 4, "upbeat": 3, "uplift": 3, "url": 0, "us": [0, 2, 3, 4, 5, 6, 7, 11, 13, 15, 16, 21], "usa": 14, "usabl": 15, "usag": 16, "user": [0, 14, 15, 17, 21], "userwarn": [], "usual": [5, 6, 7], "utf": 3, "util": [13, 17], "v": [0, 7, 13], "v1": 0, "v2": [], "valu": [4, 11, 13], "valuabl": 11, "vampnet": 12, "van": 0, "vari": 0, "variabl": 7, "varianc": 11, "variant": [4, 6], "varieti": [5, 15], "variou": [6, 14, 15, 17, 21], "vasudevan": 0, "vdodz": [0, 16], "ve": 7, "vector": [13, 17], "venv": [], "version": 11, "verzetti": 0, "vggish": 11, "via": [0, 3, 6, 13], "vibe": 3, "video": [0, 5], "view": 21, "vijai": 0, "vincent": 0, "vinyal": 0, "vision": [0, 4, 15, 17], "visit": 0, "visual": [0, 15], "vocabulari": [7, 15, 16, 17, 21], "vocal": 3, "volum": 0, "voss": 0, "w": 15, "wa": [11, 12, 14, 16], "wai": [3, 4, 12, 14, 15, 17], "wainwright": 0, "wang": 0, "want": [3, 16, 21], "warn": 15, "watanab": 0, "wav": 3, "waveform": 16, "wavegan": 16, "wavenet": [0, 12, 16], "wbz": [0, 17], "wcs21": [0, 7], "wdwb23": [0, 17], "we": [2, 3, 4, 6, 7, 11, 12, 13, 14, 15, 16, 21], "weak": [0, 11], "weaker": 15, "web": 14, "weck": 0, "wei": 0, "weight": [4, 6], "welcom": 14, "well": [4, 6, 11, 14, 15], "wen": 0, "wenhao": 0, "wenhu": 0, "wenyi": 0, "were": [4, 6, 16, 21], "what": [], "when": [5, 7, 11, 13, 21], "whenev": 16, "where": [6, 7, 11, 12, 14, 16], "whether": [7, 16], "which": [3, 4, 5, 6, 11, 12, 13, 14, 15, 16], "while": [6, 11, 15], "whisper": [14, 17], "whistl": 3, "whitman": 16, "whole": 7, "why": 14, "wide": [5, 11, 15, 16, 17], "wider": [6, 15], "wikimut": 3, "wikipedia": 3, "william": 0, "within": [6, 7, 15], "without": [4, 12, 15, 17], "wmb": [0, 4], "wnn": [0, 7], "wolf": 0, "won": 0, "wook": [0, 14], "word": [7, 15, 16, 17, 21], "work": [2, 6, 7, 12, 14], "world": [14, 16], "would": 21, "write": 15, "written": 3, "wu": 0, "x": [6, 7, 16], "x_1": [], "x_2": [], "x_m": [], "xavier": 0, "xianzhao": 0, "xiaob": 0, "xiaohuan": 0, "xie": 0, "xu": 0, "xuezhi": 0, "y": [0, 6, 7, 16], "y_": [6, 7], "y_1": [6, 7], "y_2": [6, 7], "y_t": [6, 7], "yan": 0, "yang": 0, "yaniv": 0, "yanqi": 0, "year": [5, 6, 7], "yellow": 12, "yet": [6, 16], "yeung": 0, "yi": 0, "ying": 0, "yinghao": 0, "yixiao": 0, "yoav": 0, "yonghui": 0, "york": 14, "yoshua": 0, "yossi": 0, "you": [3, 14, 16], "youngmoo": 0, "your": [2, 9, 19], "yourself": 14, "youtub": [], "youtube8m": [], "yt": 3, "yt8m": [], "yu": 0, "yudong": 0, "yue": 0, "yun": 0, "yunfei": 0, "yunxuan": 0, "yusong": 0, "ywv": [0, 15], "zachari": [0, 14], "zal": 0, "zejun": 0, "zen": 0, "zero": 0, "zhang": 0, "zhang_bertscore_2020": [], "zhao": 0, "zhiji": 0, "zhou": 0, "zhouhang": 0, "zhouyu": 0, "zihao": 0, "zirui": 0, "zoph": 0, "zornitsa": 0, "zuchao": 0, "\u00e1": 0, "\u00e9": 0, "\u00f6": 0}, "titles": ["Bibliography", "Conclusion", "Code Tutorial", "Music Description Datasets", "Music Description Evaluation", "Overview", "Music Description Models", "Music Description Tasks", "Beyond Text-Based Interactions", "Code Tutoiral", "StableAudio - Diffusion-based Model", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Language Model", "Code Tutorial", "Conversational Retrieval", "Introduction to Text-to-Music Retrieval", "Audio-Text Joint Embedding"], "titleterms": {"1": [2, 15], "2": [2, 15], "3": 15, "A": [], "The": 3, "about": 14, "abstract": [], "adapt": 6, "aim": 14, "almost": 15, "anchor": 11, "annot": [16, 17], "answer": 7, "audio": [6, 11, 13, 14, 22], "author": 14, "automat": [4, 5], "autoregress": 6, "background": 16, "base": [4, 8, 10], "beyond": 8, "bibliographi": 0, "brief": [], "caption": [2, 3, 7], "code": [2, 9, 19], "codec": 13, "complex": [], "conclus": 1, "connect": 14, "convers": 20, "data": [], "databas": 17, "dataset": 3, "decod": 6, "definit": 12, "describ": 3, "descript": [3, 4, 5, 6, 7], "dialogu": 7, "diffus": 10, "distanc": 11, "divers": 11, "do": 5, "earli": 16, "embed": 22, "encod": [6, 15], "evalu": [2, 4, 11], "fad": 11, "fid": 11, "friendli": 15, "from": 7, "fr\u00e9chet": 11, "gener": [2, 7, 16, 17], "get": 14, "hidden": 11, "histori": 12, "human": 15, "i": [5, 15], "incept": 11, "input": 6, "interact": 8, "interfac": 15, "introduct": [12, 21], "joint": 22, "label": [7, 15], "langaug": [15, 17], "languag": [14, 17, 18], "learn": 15, "limit": 11, "listen": 11, "llm": 6, "match": 4, "mc": 3, "mean": 11, "metric": 4, "mo": 11, "model": [6, 10, 17, 18], "motiv": 14, "mtc": 3, "multimod": 6, "multipl": 11, "mushra": 11, "music": [2, 3, 4, 5, 6, 7, 14, 16, 17, 21], "musiccap": 3, "musicgen": 13, "musictextclip": 3, "natur": [14, 15, 17], "need": 5, "neural": 13, "opinion": 11, "other": [4, 6], "overview": [3, 5, 17], "paradigm": 6, "part": 2, "problem": 12, "qualiti": 11, "question": 7, "refer": [3, 4, 6, 7, 11], "relev": 11, "represent": 15, "retriev": [16, 17, 20, 21], "sampl": 17, "scalabl": 15, "score": 11, "sdd": 3, "song": 3, "stableaudio": 10, "stage": 16, "start": 14, "static": [], "stimuli": 11, "supervis": 15, "synthet": 3, "task": [7, 15], "test": 11, "text": [3, 6, 8, 11, 21, 22], "transform": 6, "tutoir": 9, "tutori": [2, 17, 19], "type": 4, "univers": 15, "we": 5, "weak": 15, "what": 5, "why": [5, 15], "y": 15, "youtube8m": 3, "yt8m": 3, "z": 15}})
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[15, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[15, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[15, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[14, "about-the-authors"]], "Adapting LLMs to audio-text inputs": [[6, "adapting-llms-to-audio-text-inputs"]], "Architecture": [[10, "architecture"]], "Audio Diversity and Quality": [[11, "audio-diversity-and-quality"]], "Audio-Text Joint Embedding": [[22, null]], "Background": [[16, null]], "Beyond Text-Based Interactions": [[8, null]], "Bibliography": [[0, null]], "Code Tutoiral": [[9, null]], "Code Tutorial": [[2, null], [19, null]], "Conclusion": [[1, null]], "Conditioning": [[10, "conditioning"]], "Connecting Music Audio and Natural Language": [[14, null]], "Conversational Retrieval": [[20, null]], "Datasets with synthetic text": [[3, "datasets-with-synthetic-text"]], "Diffusion Model-based Text-to-Music Generation": [[10, null]], "Diffusion: Continuous Generation through Iterative Refinement": [[10, "diffusion-continuous-generation-through-iterative-refinement"]], "Early Stage of Music Annotation and Retrieval": [[16, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[16, "early-stage-of-music-generation"]], "Encoder-Decoder Models": [[6, "encoder-decoder-models"]], "Evaluation": [[11, null]], "From captions to dialogues": [[7, "from-captions-to-dialogues"]], "From labels to captions": [[7, "from-labels-to-captions"]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[11, "frechet-inception-distance-fid-fad"]], "Getting Started": [[14, "getting-started"]], "History": [[12, "history"]], "Inception Score": [[11, "inception-score"]], "Introduction": [[12, null]], "Introduction to Text-to-Music Retrieval": [[21, null]], "Langauge Models": [[17, "langauge-models"]], "Language Model": [[18, null]], "Limitation": [[11, "limitation"]], "Listening Test": [[11, "listening-test"]], "MOS Test (Mean Opinion Score)": [[11, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[11, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Match-based metrics": [[4, "match-based-metrics"]], "Motivation & Aims": [[14, "motivation-aims"]], "Multimodal Autoregressive Transformers": [[6, "multimodal-autoregressive-transformers"]], "Music Annotation (Music -> Natural Language)": [[17, "music-annotation-music-natural-language"]], "Music Caption Datasets": [[3, "music-caption-datasets"]], "Music Captioning": [[7, "music-captioning"]], "Music Description Datasets": [[3, null]], "Music Description Evaluation": [[4, null]], "Music Description Models": [[6, null]], "Music Description Tasks": [[7, null]], "Music Dialogue Generation": [[7, "music-dialogue-generation"]], "Music Generation (Natural Language -> Sampled Music)": [[17, "music-generation-natural-language-sampled-music"]], "Music Question Answering": [[7, "music-question-answering"]], "Music Retrieval (Natural Language -> Database Music)": [[17, "music-retrieval-natural-language-database-music"]], "MusicCaps (MC)": [[3, "musiccaps-mc"]], "MusicGEN": [[13, null], [13, "id4"]], "Neural Audio Codec": [[13, "neural-audio-codec"]], "Other Modelling Paradigms": [[6, "other-modelling-paradigms"]], "Other types of automatic evaluation": [[4, "other-types-of-automatic-evaluation"]], "Overview": [[3, "overview"], [5, null]], "Overview of Tutorial": [[17, null]], "Part 1: Music Caption Generation": [[2, "part-1-music-caption-generation"]], "Part 2: Evaluation": [[2, "part-2-evaluation"]], "Problem Definition": [[12, "problem-definition"]], "References": [[3, "references"], [4, "references"], [6, "references"], [7, "references"]], "Representation": [[10, "representation"]], "Text Relevance": [[11, "text-relevance"]], "The Song Describer Dataset (SDD)": [[3, "the-song-describer-dataset-sdd"]], "What is music description?": [[5, "what-is-music-description"]], "Why Natural Langauge?": [[15, null]], "Why do we need automatic music description?": [[5, "why-do-we-need-automatic-music-description"]], "YouTube8M-MusicTextClips (YT8M-MTC)": [[3, "youtube8m-musictextclips-yt8m-mtc"]]}, "docnames": ["bibliography", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/beyondtext", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/intro", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/intro", "retrieval/joint_embedding"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/beyondtext.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/intro.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/intro.md", "retrieval/joint_embedding.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [3, 6, 7, 8, 10, 14, 15, 16, 17], "0": [3, 7, 10, 11, 15], "00341": 0, "01": 0, "01337": 0, "01420": 0, "01546": 0, "01652": 0, "01917": 0, "02": 0, "03": 0, "03499": 0, "04": 0, "04208": 0, "04805": 0, "04868": 0, "06": 0, "07": 0, "07069": 0, "07160": 0, "07837": 0, "07848": 0, "07919": 0, "08": 0, "08774": 0, "1": [0, 3, 4, 6, 7, 9, 10, 11, 19], "10": [0, 3, 7, 14], "100": 11, "10057": 0, "10191775": 0, "10447027": 0, "1076": 0, "11": [], "1109": 0, "11255": 0, "11325": 0, "11401": 0, "11415": 0, "11834": 0, "1186": [], "12": 15, "1208": 0, "120bpm": 15, "1212": 0, "12179": 0, "12208": 0, "12415": 0, "13": 15, "130bpm": 15, "13438": 0, "14": 14, "140": 0, "14784": 0, "14793": 0, "14rn7hpkvk": 0, "150": 15, "15885": 0, "16": [0, 15, 16], "1608": 0, "1609": 0, "1612": 0, "16372": 0, "16798": 0, "17": [0, 16], "17th": 0, "18": [0, 16, 17], "1802": 0, "1805": 0, "1810": 0, "18653": 0, "19": [0, 17], "1950": 12, "1990": 12, "1d": [10, 13], "1k": [], "1m": 10, "2": [0, 3, 4, 10, 17], "20": [0, 14, 17], "200": 21, "2000": 16, "2002": 0, "2003": 0, "2005": [0, 16], "2007": 0, "2008": 0, "2010": [0, 16], "2015": 12, "2016": 0, "2017": 0, "2018": [0, 12], "2019": 0, "2020": [0, 12], "2021": [0, 10, 14], "2022": 0, "2023": 0, "2024": [0, 14], "20445": 0, "21": [0, 15, 17], "2109": 0, "2161": 0, "22": [0, 15, 17], "2205": 0, "2208": 0, "2210": 0, "22k": 3, "23": [0, 3, 7, 10, 17], "2301": 0, "2303": 0, "2305": 0, "2307": 0, "2308": 0, "231": 0, "2310": 0, "2311": 0, "2392": 0, "2396": 0, "24": [0, 4, 6, 7, 17], "2401": 0, "2406": 0, "2407": 0, "2408": 0, "25": 0, "25th": [0, 14], "26": 0, "263": [], "264k": 3, "27": [], "2713830": 0, "2754": 0, "2764": 0, "28492": 0, "28518": 0, "286": 0, "28k": 3, "29": 3, "290": 0, "293": 0, "2d": 10, "2k": 3, "2m": 3, "3": [0, 3, 4, 14, 16, 17], "30": [3, 7, 10], "302": 0, "3169": [], "32": [0, 10], "33k": 3, "3643": 0, "3655": 0, "4": [0, 3, 4, 14, 15, 17], "42": 15, "44": 10, "4407": 0, "44100": [2, 9, 19], "456": 0, "460": 0, "467": 0, "46th": 0, "476": 0, "48550": 0, "5": [0, 2, 3, 9, 11, 15, 17, 19], "50": 21, "52": [], "521": [], "525": [], "53": 0, "531": 0, "534": 0, "53k": 3, "540": [], "55bpm": 15, "566": [], "5k": 3, "6": [3, 15], "64": 10, "65": [], "67": 0, "7": [3, 15], "70": 0, "768": [], "784": 0, "7952585": 0, "8": [0, 3, 15], "800560": 0, "84": [], "86": 0, "8748": 0, "8763": 0, "8821": 0, "8831": 0, "88k": 3, "9": 3, "937": 0, "952": 0, "953": [], "96": 10, "9k": 3, "A": [0, 3, 7, 10, 11, 16], "And": [10, 14], "As": [4, 5, 6, 7, 10, 13, 16, 21], "At": 6, "BY": 3, "Being": 5, "But": 6, "By": 12, "For": [2, 5, 6, 7, 10, 15], "If": [10, 16, 21], "In": [0, 2, 5, 7, 10, 11, 12, 13, 14, 15, 16, 21], "It": 17, "No": 15, "One": [6, 10, 11], "The": [0, 4, 5, 6, 10, 11, 12, 13, 14, 15, 16, 17], "Then": [10, 12], "There": [6, 10], "These": [3, 4, 6, 7, 11, 12, 16], "To": [8, 11, 14], "With": 14, "_": 10, "_0": 10, "_c": [], "_get_default_devic": [], "_i": [], "_t": 10, "aaa": [0, 17], "aaai": 14, "aaron": 0, "ab": 0, "abil": [0, 10, 11, 15, 16], "abl": [5, 10, 11], "ablat": 11, "about": [7, 8, 10, 15], "abov": [10, 12, 13, 15], "abstract": [0, 4, 5], "abu": 0, "academ": 14, "access": 14, "accompani": [3, 12], "accur": [8, 11, 12], "achiam": 0, "achiev": [6, 12], "aclanthologi": 0, "acm": 0, "acoust": [0, 3, 15], "across": [12, 14], "activ": [10, 14], "actual": [10, 15], "ad": [4, 10, 16], "adaln": 10, "adam": 0, "adapt": [7, 10, 11, 12], "adb": [0, 3, 17], "add": 10, "addit": [10, 11, 14], "addition": [11, 12, 15, 17], "address": [7, 8, 17], "adi": 0, "aditya": 0, "adler": 0, "adob": 14, "adopt": [4, 10, 11], "advanc": [0, 11, 12, 14, 15, 16, 17], "advantag": [15, 16], "adversari": [0, 13, 16], "advis": 14, "aesthet": 0, "after": 10, "afternoon": 3, "against": 11, "agarw": 0, "aggreg": [0, 4], "aggress": 15, "agostinelli": 0, "ahmad": 0, "ai": [14, 15], "aim": [12, 16], "aka": 10, "akkaya": 0, "al": 6, "album": 3, "alec": 0, "aleman": 0, "alex": 0, "alexandr": 0, "algorithm": 12, "align": 12, "all": [7, 8, 10, 11, 14, 21], "allow": [7, 10, 14, 15], "almeida": 0, "almost": [10, 12], "along": [3, 5, 10], "alongsid": 6, "also": [4, 5, 7, 11, 12, 14, 15, 21], "altenschmidt": 0, "altern": [3, 16, 17], "although": 15, "altman": 0, "amanda": 0, "amit": 0, "amodei": 0, "among": [3, 4], "amu": [5, 7], "an": [0, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 15, 17, 21], "anadkat": 0, "analogi": 0, "analys": 5, "anderson2016": [], "andrea": 0, "andrew": 0, "ann": 14, "annot": [0, 5, 15, 21], "anoth": 6, "answer": [0, 2, 3, 4, 15], "anticipatori": 12, "antoin": 0, "anyth": [8, 10], "appear": 6, "append": 10, "appl": 14, "appli": [3, 6, 11, 12, 14], "applic": [5, 7, 14, 16, 17], "approach": [6, 10, 12, 14, 16], "appropri": [11, 16, 21], "approx": 10, "approxim": 10, "ar": [0, 3, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17], "arab": 0, "arbitrari": 14, "arbor": 14, "architectur": [6, 8, 11, 12, 13, 15], "area": [11, 12, 14, 16], "aren": 0, "art": [0, 6, 12], "articul": 12, "artifici": 14, "artist": [12, 16], "arun": 0, "arxiv": 0, "askel": 0, "aspect": [6, 7, 11], "assess": 11, "assign": [7, 11, 12], "associ": [0, 17], "atin": 0, "attempt": 16, "attent": [6, 10, 13], "attribut": [15, 16, 17], "audio": [0, 2, 3, 4, 5, 7, 9, 10, 12, 16, 19, 21], "audio_base64": 3, "audio_byt": 3, "audio_html": 3, "audio_sampl": [2, 9, 19], "audiogen": 12, "audioldm": 12, "audioset": 3, "auditori": 17, "augment": [0, 3, 10], "august": 0, "auto": [0, 7, 13, 17], "autoencod": [0, 10, 16], "autom": 16, "automat": [0, 14], "autoregress": [10, 12], "autoregresst": 12, "avail": 16, "averag": [4, 11], "awai": [3, 4], "awar": 7, "ax": 5, "axi": [5, 10, 15], "b": [], "b64encod": 3, "bach": [0, 12], "back": [10, 12], "backbon": 10, "bad": 8, "balog": 0, "bar": 10, "barret": 0, "barrington": 0, "base": [0, 5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 21], "base64": 3, "basi": 3, "basic": [16, 17], "beat": [0, 3], "becaus": [10, 11, 15, 16], "becom": [6, 8, 12, 15], "been": [7, 8, 10, 11, 12, 14, 21], "befor": [10, 11], "began": 12, "begin": 17, "behind": 11, "being": [5, 10, 15], "believ": 14, "below": [3, 4, 10], "benchmark": 0, "benefit": [7, 14], "beneto": [0, 14], "bengio": 0, "benjamin": 0, "benno": 0, "berg": [0, 14], "bert": [0, 4, 10], "bertin": 0, "best": [8, 10, 14], "bethard": 0, "better": [4, 10], "between": [4, 5, 7, 11, 12, 14, 16, 21], "beyond": [0, 6, 16, 17], "bias": 11, "bidirect": [0, 10], "big": 10, "bit": 10, "bittner": 0, "blend": 12, "bleu": 4, "bleu_1": 4, "block": [10, 13], "blog": [0, 10], "blue": [3, 12, 15], "bodganov": 0, "bogdanov": 0, "boldsymbol": 10, "book": 14, "boolean": 17, "boost": 8, "borrow": 4, "borso": 0, "bosma": 0, "both": [6, 10, 11, 21], "bottleneck": [12, 13], "boyer": 0, "brahma": 0, "brandon": 0, "brass": 3, "break": [0, 10], "breakthrough": 12, "breviti": 4, "brian": [0, 16], "bridg": [0, 12, 17], "briefli": [4, 15], "bright": 15, "broad": [10, 12], "brockman": 0, "browser": [2, 9, 19], "bryan": 0, "build": [10, 14], "built": [10, 21], "burgeon": 14, "byte": 3, "byted": 14, "c": [0, 6, 10, 15, 16], "ca": 14, "cacul": 11, "caillon": 0, "calcul": 11, "california": 14, "call": [10, 15], "can": [3, 5, 6, 7, 8, 10, 11, 14, 15, 21], "candid": [4, 14], "cannot": [7, 10, 16], "capabl": [12, 14, 16], "capit": 10, "caption": [0, 4, 5, 6, 8, 10, 14, 15], "captur": [4, 5, 16], "carol": 0, "carrol": 0, "cascad": 16, "case": [3, 5, 7, 8, 10, 11], "casual": 3, "cat": 15, "categor": 7, "categori": [7, 10], "cater": 14, "caus": 10, "cc": 3, "cd": 14, "cdot": 10, "cell": 15, "celma": 0, "centr": 14, "certainli": 10, "cfs16": [0, 6], "cfsc17": [0, 7], "chaganti": 0, "challeng": [0, 12, 14, 16, 17], "chang": [0, 8, 17], "changli": 0, "channel": [10, 16], "chao": 0, "chapter": [15, 17, 21], "characterist": [4, 7], "chartmetr": 14, "chatgpt": 15, "cheat": 11, "check": [10, 15], "chelsea": 0, "chen": [0, 14], "chenshuo": 0, "child": 0, "chl": [0, 17], "cho": 0, "choi": [0, 6], "choic": [4, 6], "chong": 0, "choppi": 3, "choral": 12, "chord": [3, 8], "chosen": 7, "chou": 0, "chri": 0, "christin": 0, "chronolog": 10, "chu": 0, "chung": 0, "cider": 4, "cinjon": 0, "circul": 8, "circumv": 10, "cite": 7, "citep": [], "citi": 0, "ckg": [0, 17], "clap": [10, 11], "clariti": 10, "clark": 0, "class": [8, 15], "classic": [3, 10, 15], "classif": [0, 7, 11, 16, 17, 21], "classifi": 7, "clean": 10, "cleaner": 10, "cleanli": 10, "clear": 10, "clever": 10, "clich\u00e9": 12, "clip": [7, 14, 15, 17], "clone": 14, "close": [6, 12], "closer": 3, "closest": 16, "clpn19": [0, 17], "clz": [0, 17], "cn24": [0, 11, 12, 13], "cnn": 6, "co": [11, 14], "coars": [7, 8], "coca": [0, 15], "code": [10, 14, 17], "codec": [10, 11, 17], "coeffici": 10, "colin": 0, "collect": [0, 5], "colleg": 14, "color": 15, "com": 14, "combin": [4, 10, 11, 13, 14], "come": [4, 10], "common": 4, "commonli": [4, 11], "commun": [14, 15, 16], "compar": [4, 5, 11, 12, 15, 16, 17], "comparison": 4, "complement": 11, "complet": [10, 14], "complex": [5, 7, 10, 11, 12, 15, 16, 17], "compon": [4, 17], "compos": [6, 12], "composit": [12, 14], "comprehend": 14, "comprehens": [11, 14], "compress": [0, 10], "compris": [], "comput": [0, 4, 10, 11, 12, 14], "concaten": [6, 10, 13], "concept": [7, 10, 12, 15, 17], "conceptu": 10, "concern": 10, "conclud": 10, "conda": 14, "condens": 10, "condit": [7, 13, 16, 17], "confer": [0, 14], "confid": 11, "cong": 0, "connect": 16, "consensu": 4, "consequ": 6, "consid": [4, 11], "consist": [3, 12, 13, 14, 16], "consistut": [], "consolid": 6, "constant": 3, "constraint": 12, "consum": 11, "contain": [3, 7, 8, 14, 21], "content": [0, 5, 7, 8, 14, 21], "context": [6, 10], "contextu": [4, 15], "continu": [4, 11, 13], "contrast": [0, 11, 15], "contribut": [12, 14], "control": [0, 3, 8, 10, 12, 14, 16], "controlnet": 0, "convei": [7, 12], "conveni": 4, "convent": 14, "convers": [0, 7, 11, 14, 17, 21], "convert": [10, 13, 17, 21], "convolut": [0, 10, 13], "cook": 0, "copet": 0, "copyright": 3, "core": [3, 6, 10], "corner": 11, "corpora": 15, "corpu": 0, "correct": 10, "correl": [4, 8], "correspond": [7, 10, 11], "corrupt": 10, "cosin": [4, 11], "cost": [10, 11], "costli": 21, "could": [10, 16], "coupl": 5, "courvil": 0, "cover": [5, 7, 14, 15, 16], "cpp": [], "cpu": [], "creat": [3, 4, 5, 11, 14, 16], "create_audio_html": 3, "creation": [12, 14, 17], "creativ": [12, 16], "criteria": 11, "crop": 8, "cross": [6, 10, 13], "crowdsourc": 3, "crucial": [7, 11], "csl": 12, "csrc": [], "cue": [12, 17], "curat": 0, "current": 12, "curv": 10, "cut": 17, "cvf": 0, "cvpr": [0, 14], "cvpr52729": 0, "cwl": [0, 10, 17], "cxz": [0, 7], "cyril": 0, "d": [0, 10, 14], "d_c": 10, "d_h": 10, "d_t": 10, "d_w": 10, "dac": 10, "dai": [0, 3], "dall": 17, "danc": [3, 8, 15], "danceabl": 3, "daniel": 0, "dario": 0, "dark": 15, "data": [0, 3, 5, 10, 11, 12, 15, 16, 21], "databas": [16, 21], "datafram": 3, "dataset": [0, 5, 8, 14, 15, 16, 17, 21], "date": 12, "david": 0, "dcln23": [0, 6, 17], "dclt18": [0, 17], "dcsa22": [0, 10], "ddsp": 12, "de": 10, "deaf": 5, "deal": 10, "decemb": 0, "decod": [3, 10, 13, 16, 17], "deconvolut": 13, "dedic": 10, "deep": [0, 6, 7, 11, 12, 14, 16], "deepbach": 12, "deeper": [10, 14], "deepli": 16, "deepmind": 14, "def": 3, "defin": [10, 17], "definit": 17, "dehghani": 0, "delv": [14, 17], "demo": 0, "demonstr": [13, 15, 17], "den": 0, "deng": 0, "dengsheng": 0, "denk": 0, "denot": [7, 10], "densiti": 10, "depart": 14, "departur": 14, "depend": [4, 7, 10, 11, 14], "deploi": 14, "depth": 17, "deriv": [3, 10, 11, 14], "describ": [0, 5, 6, 7, 8, 15, 21], "descript": [0, 12, 14, 15, 17, 21], "descriptor": 7, "deshmukh": 0, "desideatum": 11, "design": [4, 6, 10, 11], "desir": [16, 21], "despit": 16, "desw23": [0, 7], "detail": [3, 7, 8, 11, 12, 14, 15, 17], "determin": 10, "develop": [0, 6, 7, 11, 12, 14, 15, 16, 17], "devic": [], "devlin": 0, "df": 3, "dhabi": 0, "dhariw": 0, "dialog": 17, "dialogu": [3, 14], "did": 6, "diego": 14, "dieleman": [0, 10, 16], "differ": [4, 5, 6, 7, 10, 11, 13, 16], "differenti": [6, 10], "difficulti": 15, "diffus": [0, 12, 15, 17], "dimens": 10, "dimension": 10, "diogo": 0, "direct": [5, 10, 14], "directli": [3, 10, 12, 14], "discount": 4, "discoveri": 14, "discret": [10, 13, 17], "discrimin": [10, 13, 16], "discuss": [8, 10, 11, 14, 17], "displai": [2, 3, 9, 19], "distinct": [11, 12], "distinguish": [0, 5, 6], "distribut": [3, 10, 11, 16], "dit": 10, "ditto": [0, 14], "diverg": 11, "divers": [14, 17], "djp": [0, 17], "dl": 6, "dmitri": 0, "dml": [0, 6, 7], "dmp18": [0, 16], "do": [8, 10, 14], "docnam": [], "doctor": 14, "document": 0, "doe": [2, 4, 7, 9, 10, 11, 19], "doesn": 10, "doh": [0, 14], "doi": 0, "domain": [10, 11, 12, 13, 14, 15, 16, 17], "donahu": 0, "done": [4, 10], "dong": 0, "dot": [7, 11], "dougla": 0, "down": 10, "downsampl": 10, "downstream": [14, 15], "draw": 11, "drift": 10, "drive": 12, "driven": [16, 17], "drum": 8, "du": 0, "dubnov": 0, "due": 12, "duh": 0, "durand": 0, "durat": [2, 9, 12, 19], "dure": [11, 14, 17], "dwcn23": [0, 15, 17], "dynam": 12, "d\u00e9fossez": 0, "e": [0, 3, 6, 7, 10, 15, 17], "each": [4, 7, 10, 11, 16, 21], "earli": [6, 12, 21], "earlier": 6, "earliest": 6, "easier": [5, 16], "easili": 16, "ecal": 3, "eck": 0, "edg": 17, "edit": 14, "editor": 0, "educ": [5, 14], "eerili": 10, "eess": 0, "effect": [0, 3, 11, 15], "effici": [0, 10, 12, 14], "effort": 7, "either": [4, 7, 10], "elabor": 11, "elbmg07": [0, 16], "electron": [3, 15], "element": [2, 7, 9, 19], "elena": 0, "eleph": 8, "elio": [0, 14], "elizald": 0, "ell": 10, "elli": 0, "els": 15, "emb": 10, "embed": [0, 4, 6, 10, 11, 13, 15, 17, 21], "embeddings_2d": 15, "emerg": [6, 7, 12, 14], "emir": 0, "emmanouil": [0, 14], "emnlp": 0, "emot": [0, 12], "emphas": 17, "emphasi": [14, 16], "empir": 0, "emploi": [6, 13], "enabl": [3, 5, 12, 14, 15, 17], "encod": [10, 12, 13, 16, 17], "encodec": [10, 13], "encompass": 7, "encourag": [10, 14], "end": [0, 3, 10, 14], "engag": 7, "engel": 0, "engin": [14, 16], "english": 3, "enhanc": [0, 12, 17], "enjoi": 7, "enorm": 21, "ensur": 11, "enter": 16, "entir": [10, 11], "entropi": [11, 13], "enumer": 15, "environ": 14, "epur": 0, "equat": 10, "era": [7, 14], "erik": 0, "err": [0, 16], "escap": 3, "especi": [7, 10], "essenti": [11, 17], "establish": [7, 14], "et": 6, "etc": [10, 15], "evalu": [0, 12, 14, 17], "even": [8, 11, 21], "event": 7, "evolut": [7, 17], "evolv": [12, 16], "exact": [4, 10], "examin": 17, "exampl": [2, 5, 6, 7, 12, 13, 14, 15, 17], "excel": [11, 15], "except": [], "excit": [8, 15], "exemplifi": 12, "exercis": [8, 17], "exist": [3, 10, 15, 16], "expand": [3, 12], "experi": 0, "expertis": 14, "explain": [14, 21], "exploit": [], "explor": [0, 7, 12, 14, 17], "express": [7, 12, 15], "expressivenss": 12, "extend": [0, 3, 7], "extra": 10, "extract": [3, 10], "extractor": 6, "f": [3, 4, 7, 10], "f_": [6, 10], "face": [12, 21], "facilit": 14, "fact": 10, "fail": 8, "failur": 8, "fall": [], "fals": 3, "fantast": 10, "far": 7, "fast": 3, "fatigu": 11, "favour": 6, "fazeka": [0, 14], "featur": [0, 3, 5, 6, 8, 10, 13, 16], "fedu": 0, "feedback": [0, 11], "feel": 3, "felix": 0, "few": 10, "fidel": 0, "field": [12, 14], "figsiz": 15, "figur": [12, 13, 15], "filip": 0, "fill": [], "film": [5, 10], "filter": 16, "filterwarn": 15, "final": [4, 10, 11, 12, 14], "find": [0, 5, 16, 21], "fine": [0, 7, 8, 10, 14], "finetun": [0, 10], "firmli": 14, "first": [6, 10, 11, 12, 14, 16], "fit": [6, 8, 10], "fit_transform": 15, "fix": [7, 10, 16], "flat": 11, "flexibl": [7, 15], "float": 10, "florencia": 0, "fltz10": [0, 16], "focu": [7, 10, 12, 14, 16], "focus": [3, 4, 7, 12, 14, 15, 16, 17], "follow": [0, 8, 10, 11, 12, 13, 16, 17], "fontsiz": 15, "forgo": 6, "form": [3, 5, 6, 7, 10, 15], "formal": [8, 10], "formul": 17, "forum": 0, "forward": [12, 13], "fossez": 0, "foster": 14, "foundat": [0, 14, 15], "framework": [6, 14, 17], "francisco": 14, "from": [0, 3, 4, 6, 8, 10, 11, 12, 13, 14, 15, 16, 17, 21], "frozen": 6, "fu": 0, "full": [7, 12], "fulli": [3, 10], "function": [3, 8, 10, 12, 13], "fundament": 16, "further": [10, 12, 14, 17], "furthermor": [15, 17], "fuse": 6, "fusion": 6, "futga": 0, "futur": [14, 16], "g": [0, 3, 6, 7, 10], "g_": 10, "gabbolini": 0, "gabriel": 0, "gamma_": 10, "gan": 10, "ganti": 0, "gao": 0, "gap": [], "gardner": 0, "gat": 0, "gate": 10, "gaussian": 10, "gdsb23": [0, 6, 7, 15, 17], "ge": 0, "gener": [0, 3, 4, 5, 8, 11, 12, 13, 14, 15, 21], "genr": [0, 3, 7, 8, 12, 15, 16], "geoffroi": 0, "geometr": 4, "georg": [0, 14], "geq": [], "gert": 0, "get": 10, "ghe22": [0, 7], "giovanni": 0, "girish": 0, "git": 14, "github": 14, "give": [3, 7, 10], "given": [7, 8, 10, 11, 12, 16, 21], "global": [7, 10], "gltq23": [0, 6, 7], "go": [7, 10, 12, 17], "goal": [5, 7, 10], "goh": 0, "goldberg": 0, "gomez": 0, "good": [10, 11], "googl": [12, 14], "gpt": [0, 3, 14], "gradient": 10, "gradual": 10, "grai": 0, "grain": [0, 7, 8, 14], "gram": 4, "graph": 4, "grave": 0, "great": [8, 10], "greater": 12, "greatest": 15, "green": [0, 12, 15], "greg": 0, "grew": 16, "ground": [3, 4], "groundwork": 16, "group": [], "grown": 12, "guangzhi": 0, "guid": [0, 12], "guidanc": 14, "guitar": [3, 8, 15], "gulrajani": 0, "guo": 0, "guojun": 0, "guu": 0, "gy": 0, "gy\u00f6rgi": 0, "h": 10, "ha": [3, 6, 7, 8, 10, 11, 12, 14, 16, 21], "had": 10, "hai": 0, "hallaci": 0, "hand": 7, "handl": [10, 15, 16], "hao": 0, "haoh": 0, "happen": 6, "happi": [3, 15], "hard": 5, "harmon": 4, "hat": 10, "have": [3, 5, 7, 10, 11, 12, 14, 16], "he": 14, "hear": [0, 5], "heard": 3, "heavi": 3, "heewoo": 0, "heiga": 0, "height": 10, "helena": 0, "help": [0, 11], "hennequin": 0, "her": 14, "herald": 14, "here": 10, "herrera": 0, "hi": 14, "hidden": 10, "high": [0, 6, 7, 8, 10, 11, 15, 16], "higher": [4, 5, 10, 12], "highli": 11, "highlight": [14, 17], "histori": 7, "hit": 10, "hjl": [0, 15, 17], "hlss23": [0, 6, 7], "hot": [15, 16, 17], "hou": 0, "how": [0, 4, 5, 10, 14, 17], "howev": [7, 10, 11, 14, 16, 21], "hsuan": 0, "html": 3, "http": [0, 14], "huam": 0, "huang": 0, "huge": 15, "hum": 3, "human": [0, 3, 4, 5, 7, 12, 14, 16], "hussain": 0, "hybrid": 6, "hyung": 0, "i": [0, 3, 4, 6, 7, 8, 10, 11, 12, 13, 14, 16, 21], "icassp": [0, 14], "icassp48485": 0, "icml": 14, "id": [0, 3], "idea": [10, 15], "ident": 10, "identifi": 11, "idf": 4, "ieee": [0, 14], "ieeexplor": 0, "ignor": 15, "ijcnn": 0, "ijcnn54540": 0, "ilaria": [0, 14], "ilg": 0, "illustr": [12, 13, 17], "ilya": 0, "imag": [0, 4, 6, 10, 11, 15, 16, 17], "imagin": [8, 10], "imbu": 10, "impact": 15, "imperi": 14, "implicitli": 10, "import": [2, 3, 9, 15, 19], "impress": 12, "improv": 12, "inabl": 8, "includ": [4, 11, 12, 13, 14, 16, 17], "inclus": [11, 14], "incorpor": [7, 12, 13, 17], "index": 10, "indi": 3, "indic": 11, "indulg": 14, "infer": [0, 15], "influenc": [6, 7, 17], "inform": [0, 8, 12, 14, 15, 16], "initi": [12, 16, 21], "innov": [12, 14], "input": [7, 10, 11, 12, 13, 16], "insid": 10, "insight": 11, "inspir": 12, "instal": 14, "instead": [4, 5, 6, 7, 10], "institut": 16, "instruct": [0, 17], "instrument": [0, 3, 7, 8, 12, 15, 16], "integr": [10, 11, 12, 17], "intellig": 14, "intend": 12, "intent": 15, "interact": [12, 14], "interest": [5, 14, 16], "interfac": 14, "interleav": 10, "intermedi": 6, "intern": [0, 14], "intro": 10, "introduc": [10, 12, 13, 17, 21], "introduct": 17, "intuit": 15, "invalu": 11, "investig": 17, "involv": [10, 16], "ipd": [2, 9, 19], "ipython": [2, 3, 9, 19], "ishaan": 0, "ismir": [0, 10, 14], "ismir2008": 16, "ismir2019": 16, "ismir2021": 16, "isn": [3, 4], "isotrop": 10, "issn": 0, "issu": 4, "itai": 0, "item": [0, 3, 15], "iter": 6, "its": [7, 8, 11, 13, 14, 15, 16], "itself": [8, 10], "j": 0, "jack": 0, "jacob": 0, "jacquelin": 0, "jade": 0, "jain": 0, "jamendo": 3, "janko": 0, "jansen": 0, "jason": 0, "jeffrei": 0, "jeong": 0, "jess": 0, "jiaheng": 0, "jiahui": 0, "jiajia": 0, "jiang": 0, "jin": 0, "jingren": 0, "jiyoung": 0, "jnmr": 0, "joint": [0, 17, 21], "jong": [0, 14], "jongpil": 0, "joonseok": 0, "jose": 0, "josef": 0, "josh": 0, "journal": 0, "journei": 16, "joy": 3, "judg": 4, "judgement": 4, "judith": 0, "juhan": [0, 14], "jukebox": [0, 12, 14], "julian": [0, 14], "jump": 10, "jun": 0, "junda": 0, "june": 0, "jupyt": 14, "just": [3, 10, 11], "justin": 0, "k": [7, 13], "kai": 0, "kaist": 14, "kakao": 14, "kalchbrenn": 0, "kant": 0, "karen": 0, "katarina": 0, "katherin": 0, "kavukcuoglu": 0, "ke": [0, 14], "keepdim": 15, "kei": [3, 6, 11, 12, 13, 16, 17], "kelvin": 0, "kenton": 0, "keunwoo": 0, "kevin": 0, "keyword": 12, "khz": 10, "kim": [0, 14], "kirkpatrick": [0, 14], "kl": [10, 11], "knowledg": [0, 15], "korai": 0, "kozareva": 0, "kreuk": 0, "kristina": 0, "krisztian": 0, "ksm": [0, 7], "kullback": 11, "kumar": 0, "kundan": 0, "kyunghyun": 0, "l": [4, 7], "l1": 13, "l2": 13, "lab": 14, "label": [0, 5, 11, 16], "lack": [8, 10], "laid": 16, "lam08": [0, 16], "lama": 0, "lamer": 0, "lanckriet": 0, "languag": [0, 3, 5, 6, 7, 10, 11, 12, 15, 16], "larg": [0, 5, 6, 7, 10, 17, 21], "last": [5, 15], "late": [0, 6], "latent": [10, 12, 13], "later": [6, 10, 12], "latest": [5, 7, 14], "latter": 7, "launch": 14, "laurier": 0, "layer": 10, "ldot": [6, 7], "le": 0, "lead": [4, 11, 12], "learn": [0, 5, 6, 7, 8, 10, 11, 12, 14, 16, 21], "learner": [0, 15], "least": 10, "led": 12, "lee": 0, "legend": 15, "legg": 0, "leibler": 11, "len": [10, 15], "length": [3, 4, 7, 10], "leoni": 0, "less": 11, "lester": 0, "leszczynski": 0, "let": [3, 6, 7, 15], "letter": 0, "level": [0, 4, 5, 6, 7, 8, 10, 11, 15], "leverag": [11, 12, 13, 15], "lexic": 4, "lhss24": [0, 6, 7], "li": [0, 11], "lib": [], "licens": 3, "lick": 3, "light": 6, "like": [6, 7, 8, 10, 12, 14, 15, 17], "likelihood": 16, "limit": [0, 7, 12, 15, 16], "lin": 0, "linalg": 15, "line": 15, "line2d": 15, "linear": [4, 10], "linguist": 0, "lior": 0, "list": 10, "listen": [3, 16], "literatur": [4, 6], "liu": 0, "live": 3, "ll": 10, "llama": 0, "llark": 0, "llm": [0, 3, 4, 7, 17], "lm": [10, 12, 17], "ln17": [0, 7], "load_dataset": 3, "localis": 7, "log": 10, "logit": 16, "london": 14, "long": [0, 7, 10, 11, 16, 21], "longer": [4, 7, 10], "longest": 4, "longpr": 0, "look": [6, 7, 10, 17], "loss": 13, "lost": [], "lot": [10, 15], "low": [8, 11, 15], "lower": 11, "lp": [0, 3], "lsp": 0, "lu": 0, "luan": 0, "luke": 0, "lun": 0, "lyric": 15, "lyt": [0, 4], "m": [0, 14], "ma": 0, "maarten": 0, "machin": [0, 4, 5, 6, 12, 14, 16, 17], "maestro": 0, "magazin": 0, "magenta": 12, "magnatagatun": 3, "magnatagtun": [], "mahieux": 0, "mai": [3, 11, 12], "main": [0, 3, 6, 8, 10], "major": 4, "make": [3, 6], "male": 3, "manco": [0, 6, 14], "mani": [3, 7, 8, 10, 15], "manifold": 15, "manual": 10, "map": [6, 7, 8, 10, 12], "marco": 0, "mard": 3, "margin": 10, "mari": 14, "marianna": 0, "mark": [0, 12], "marker": 15, "markerfacecolor": 15, "markers": 15, "mask": [12, 17], "massachusett": 16, "massiv": 0, "match": [11, 21], "matena": 0, "materi": 14, "math": 10, "mathbb": 10, "mathbf": 10, "mathcal": [7, 10], "mathrm": 10, "matplotlib": 15, "matrix": 11, "mauro": 0, "mbqf21": [0, 6, 7, 17], "mbqf22": [], "mbqf22a": [0, 17], "mbqf22b": [0, 15], "mcaulei": [0, 14], "mckee": 0, "mcleavei": 0, "mean": [4, 6, 7, 8, 15, 16], "meaning": [4, 21], "measur": [4, 11], "mechan": [6, 13, 14], "media": 10, "medium": 8, "megan": 0, "mehri": 0, "mel": [10, 13], "melod": 12, "melodi": [3, 8, 12], "member": 14, "mention": 15, "metadata": [8, 15], "meteor": 4, "method": [0, 8, 11, 13, 15, 16, 17], "methodologi": [14, 16, 17], "metric": [11, 17], "mexico": 0, "michael": 0, "michigan": 14, "midinet": 12, "might": [8, 12], "migneco": 0, "mikhail": 0, "miller": 0, "million": 3, "mimic": 6, "min": 3, "ming": 0, "mingni": 0, "minim": 10, "minz": 0, "mir": [7, 14, 15], "mishkin": 0, "mislead": 11, "mismatch": 8, "mit": 16, "mitsubishi": 14, "mixup": 0, "mkg": [0, 16], "mlp": 10, "modal": [0, 6, 11, 12, 14], "mode": 8, "model": [0, 2, 4, 5, 7, 11, 12, 13, 14, 15, 16, 21], "moder": 10, "modern": 10, "modul": [6, 10, 11, 13, 15], "modulenotfounderror": 15, "moham": 0, "mohammad": 0, "mojtaba": 0, "mood": [3, 7, 8, 15, 16], "mor": 0, "more": [3, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17], "morton": 0, "most": [7, 8, 10, 11, 15, 16], "mostafa": 0, "mostli": 7, "motiv": 10, "msci": 14, "msd": 3, "mssr23": [0, 3], "mtc": [], "mtg": 3, "mtt": 3, "much": [4, 6, 10], "muchomus": [0, 4], "muedit": 3, "mulab": 14, "mulan": [0, 15], "mulap": 15, "multi": [0, 7, 10, 11, 12, 13, 14, 16, 17], "multiclass": 21, "multilabel": 21, "multimedia": 0, "multimod": [0, 14, 17], "multipl": [0, 4, 7, 10, 15], "multitask": [0, 17], "multitrack": 12, "muscap": 0, "musegan": 12, "music": [0, 8, 11, 12, 13, 15], "musicbench": 3, "musiccap": 0, "musicgen": 12, "musicgenerationtempl": 12, "musician": 3, "musicinstruct": 3, "musicldm": [0, 10, 12], "musiclm": [0, 12], "musicqa": 3, "musictextclip": [], "musicva": 12, "musilingo": 0, "mwd": [0, 3], "mwpt18": [0, 16], "my": [], "n": [0, 4, 6, 10], "n_compon": 15, "naacl": 0, "nabla_": 10, "nal": 0, "nam": [0, 14], "namburi": 0, "name": [12, 14, 15, 21], "nameerror": [], "nan": 0, "narang": 0, "nativ": 6, "natur": [0, 3, 5, 6, 7, 12, 16], "navercorp": 14, "navig": 5, "nc": 3, "ncl": [0, 16, 17], "ncsoft": 14, "nearli": 10, "necessarili": 11, "need": [11, 14, 15, 16, 21], "net": [0, 10], "network": [0, 6, 12, 13, 16], "neural": [0, 6, 11, 12], "neurip": 14, "new": [0, 7, 10, 12, 13, 14, 16, 17], "next": [7, 10, 13, 14], "nezhurina": 0, "nice": 10, "nichola": 0, "nlp": 4, "nmbkb24": [0, 17], "nn": [], "no_grad": 15, "noam": 0, "nois": 10, "noisi": 15, "non": [4, 10, 12], "norm": [10, 15], "normal": 10, "norouzi": 0, "notabl": [10, 11, 12], "notat": 10, "note": [0, 3, 10, 12], "notebook": [2, 14], "novack": [0, 14], "novel": [15, 17, 21], "novelti": 0, "novemb": 14, "now": [3, 7, 10, 14], "np": 15, "nsynth": [12, 16], "nuanc": [7, 15], "number": [4, 8, 10, 11, 21], "numer": 10, "numpi": [2, 9, 15, 19], "o": 15, "object": [4, 11, 13], "obtain": [4, 11, 14], "occupi": 5, "occur": [10, 21], "octob": 0, "off": 10, "offer": [10, 11, 14, 17], "often": [3, 6, 7, 8, 11, 15], "onc": 10, "one": [3, 6, 7, 8, 10, 13, 15, 16, 17, 21], "onli": [3, 4, 6, 7, 8, 10, 14, 16, 21], "onlin": 14, "oord": 0, "open": [10, 21], "openai": [0, 3, 12, 14], "openreview": 0, "oper": 10, "optim": 0, "order": [7, 10], "org": 0, "organis": 5, "origin": [3, 4, 10, 11, 12], "oriol": 0, "orthogon": 15, "oscar": 0, "other": [0, 3, 5, 7, 10, 11, 12, 15], "otherwis": [], "our": [7, 10, 14], "out": [10, 15, 17, 21], "output": [4, 7, 10, 11, 12, 16], "outsid": [14, 21], "ouyang": 0, "over": [10, 16], "overal": [8, 10, 11], "overlap": 4, "overview": 14, "owj": [0, 17], "p": [0, 6, 7, 10, 16], "p310": 14, "p_": 10, "p_0": 10, "p_1": 4, "p_i": [], "p_n": 4, "p_t": 10, "packag": [], "page": 14, "pai": 10, "pair": [3, 5, 8, 11, 15], "pamela": 0, "panda": 3, "pandora": 14, "pann": 11, "paper": [14, 15], "paradigm": [10, 12], "parallel": 14, "paramet": 10, "parameter": 10, "park": 0, "pars": 4, "part": [10, 13, 16], "partial": 5, "particip": [11, 14], "particular": 14, "particularli": [5, 6, 8, 11, 12, 14, 16], "pass": [6, 10], "passion": 14, "patch": 10, "patchifi": 10, "path": [3, 10], "patrick": 0, "pattern": [0, 16], "paul": 0, "pave": [12, 14], "pavlov": 0, "payn": 0, "pd": 3, "peeter": 0, "penalti": 4, "pengi": 0, "peopl": 15, "per": 10, "perceiv": 11, "percept": 11, "percuss": 14, "perfecto": 0, "perform": [7, 10, 11, 12, 14, 15], "perhap": 10, "perplex": 15, "personalis": 5, "perspect": 11, "pertin": 17, "peter": 0, "ph": 14, "phase": [16, 17], "phbd03": [0, 7], "phd": 14, "philip": 0, "phrase": 15, "physic": 14, "piano": [14, 15], "pianotreeva": 12, "piec": [7, 11, 16, 21], "ping": 0, "pink": 12, "pip": 14, "pipelin": 6, "pitch": [12, 16], "play": 3, "playback": 5, "playlist": [0, 7], "pleas": 16, "plt": 15, "pmlr": 0, "point": [10, 11], "polyak": 0, "polyfuss": 12, "poor": 11, "popular": 6, "posit": [3, 5, 15], "possess": 15, "possibl": [5, 7, 11, 12], "possibli": 3, "post": 10, "potenti": [11, 14], "power": [0, 15], "practic": [5, 10, 14, 17], "prafulla": 0, "pre": [0, 2, 4, 6, 7, 10, 15], "preced": 13, "precis": [4, 14], "predefin": [4, 7], "predict": [7, 10, 12, 13, 16], "prefer": [0, 14], "prefix": [10, 13], "preprint": 0, "preprocess": 13, "present": [10, 11, 14, 17], "pretrain": [0, 11, 13, 14, 15], "previous": [10, 14, 21], "primarili": [12, 14, 16], "principl": 12, "prior": 7, "probabl": [10, 12, 13], "problem": [15, 16, 17, 21], "proc": 0, "proccess": 10, "proce": 10, "proceed": [0, 10], "process": [0, 6, 10, 11, 12, 13, 14, 16], "prod_": [6, 7], "produc": [5, 6, 7], "product": [11, 17], "progress": [3, 17], "proj": 10, "project": [10, 12], "prompt": [0, 10, 17], "propag": 0, "properti": 10, "propos": 7, "provid": [4, 10, 11, 14], "pseudo": [0, 15], "public": [], "publish": 14, "puckett": 0, "pure": [10, 17], "purpos": [5, 11, 12], "put": 8, "pw": 0, "py": [], "pyplot": 15, "python": 14, "python3": [], "pytorch": [], "q": 13, "qi": 0, "qian": 0, "qingq": 0, "quad": 10, "quadrat": 10, "qualiti": [12, 16, 17], "quantiz": [10, 13], "queen": 14, "queri": [0, 5, 13, 15, 17, 21], "question": [0, 2, 3, 4, 10], "quinton": [0, 14], "quit": [8, 10], "quoc": 0, "quot": 8, "qwen": 0, "r": [10, 14], "r_1": 4, "rachel": 0, "radford": 0, "radlinski": 0, "raffel": 0, "rai": 0, "ramesh": 0, "randn": [2, 9, 19], "random_st": 15, "rang": [5, 6, 16, 17], "rap": 15, "rapid": 14, "rapidli": 12, "rare": 8, "rate": [2, 9, 10, 11, 19], "rather": [10, 11, 15, 16], "ravi": 0, "raw": [0, 10], "raymond": 0, "re": [5, 10, 16], "readabl": 5, "reader": 10, "real": [11, 12, 14, 16], "realiti": 15, "realli": 10, "reaon": 11, "reason": [7, 10], "recal": 4, "receiv": 6, "recent": [6, 7, 12, 14, 15], "recogn": 11, "recognis": [], "recognit": [0, 16], "recommend": [0, 5, 11], "reconstruct": 13, "recurr": [0, 12], "reduc": 4, "refer": [0, 14, 16], "reflect": [11, 12, 21], "region": 10, "regress": [13, 17], "regular": 10, "reinvent": 10, "rel": [7, 15, 16], "relat": [14, 15, 21], "relationship": [4, 7, 16], "relax": 3, "relev": [10, 14, 17, 21], "reli": [6, 11, 12], "remain": [10, 15], "remark": 14, "remez": 0, "remi": 12, "remov": 10, "renum": 3, "repeat": 3, "replac": 10, "repo": [], "report": [0, 15], "repositori": 14, "repres": [11, 12, 15, 16], "represent": [0, 4, 6, 11, 14], "request": 15, "requir": [5, 6, 8, 10, 11, 14], "research": [0, 7, 11, 12, 14, 16], "residu": [10, 13], "resnick": 0, "resolut": [8, 10, 13], "resourc": 5, "respect": 7, "respons": [0, 11, 15, 21], "respos": 6, "rest": [7, 10], "result": [4, 6, 10, 11, 16, 21], "retriev": [0, 5, 12, 14, 15], "return": 3, "revers": 10, "review": [0, 3, 4, 7, 10, 15], "rewon": 0, "rgy": 0, "rhythm": [3, 8], "rich": 15, "richardson": 0, "richer": 16, "riff": 3, "riffus": 12, "right": [3, 10], "rightarrow": [7, 10], "rise": 16, "rita": 0, "rithesh": 0, "rkh": [0, 15, 17], "rkx": [0, 17], "rnn": [6, 12], "robert": 0, "robust": 0, "rock": [0, 3, 8, 15], "role": 17, "roll": 3, "romain": 0, "rongchen": 0, "room": 8, "roug": 4, "rouge_l": [], "rough": 10, "round": 14, "rpg": [0, 17], "rsr": [0, 17], "run": [10, 14], "runner": [], "runtimeerror": [], "russel": 0, "rvq": 13, "rwc": [0, 17], "s4": 12, "s_": 10, "sa": [], "sakkeer": 0, "salamon": 0, "salient": [8, 14], "salmonn": 0, "sam": 0, "same": [10, 11, 15], "sampl": [10, 11], "samplernn": [0, 12, 16], "san": 14, "sander": [0, 10, 16], "sandhini": 0, "sandler": 0, "sashimi": 12, "sastri": 0, "satisfact": 15, "scale": [0, 10, 11], "scatter": 15, "scenario": 14, "scene": [4, 10], "scheme": 6, "schmidt": 0, "scienc": 14, "scope": 15, "score": [4, 10], "scott": 0, "sd": 10, "sdd": [], "sde": 10, "search": [5, 14, 16, 17, 21], "sec": 3, "second": [10, 11, 16], "section": [5, 12, 13, 15], "see": [4, 5, 10], "seek": [7, 10, 14], "seen": [3, 15], "segment": 7, "select": 11, "self": [10, 14, 15], "semant": [0, 4, 15, 17, 21], "semi": 0, "senior": 0, "sens": 10, "sensit": 11, "sentenc": [4, 6, 7, 12, 15, 17], "sentence_transform": 15, "sentencetransform": 15, "separ": [6, 10, 14], "sequenc": [6, 7, 10], "seri": 14, "serra": 0, "serv": [11, 13, 14, 15], "server": 14, "session": [0, 11], "set": [0, 3, 4, 7, 11, 21], "setup": 11, "seungheon": [0, 14], "seventh": 0, "sever": [5, 6, 11], "seyedhosseini": 0, "shan": 0, "shansong": 0, "sharan": 0, "share": [6, 16], "shayn": 0, "shazeer": 0, "she": 14, "shelf": 10, "shift": [7, 10, 16], "shih": 0, "shiliang": 0, "shinji": 0, "shlomo": 0, "short": 16, "shot": 0, "should": [10, 14], "show": [15, 21], "shown": 12, "shu": 0, "shubham": 0, "shyamal": 0, "siddhartha": 0, "sigir": 0, "signal": [0, 10, 11, 12, 13, 14], "signatur": 12, "signifi": 14, "signific": 14, "sim": 10, "similar": [0, 4, 6, 8, 10, 11], "similarli": 4, "simon": 0, "simonyan": 0, "simpl": [0, 12, 17], "simpler": 10, "simplest": 7, "simpli": 10, "simultan": [12, 15], "sinc": [10, 11, 12], "sing": 14, "singer": 3, "singh": 0, "singl": [0, 7, 10, 11, 15, 16], "site": [], "situat": 11, "sivic": 0, "size": [3, 7, 10], "sketchnet": 12, "sklearn": 15, "slama": 0, "slc07": [0, 16], "slowdown": 10, "small": [7, 10, 11], "snake": 10, "sne": 15, "so": [4, 7, 10], "social": [0, 15, 16], "societi": [0, 14], "soft": [3, 15], "softwar": 14, "soham": 0, "solid": 14, "solo": 3, "solv": [10, 15, 21], "some": [5, 6, 7, 10, 12, 16], "someth": 10, "sometim": [6, 7], "song": [0, 8, 21], "soni": [12, 14], "soon": 12, "sordo": 0, "soroush": 0, "sort": [8, 10], "sotelo": 0, "sound": [0, 3, 10, 15, 17], "sourc": [3, 14, 16], "southern": 14, "space": [6, 10, 12, 15], "span": [5, 14, 17], "speak": 10, "special": 15, "specif": [6, 7, 8, 11, 12, 14, 16, 17], "speck": 0, "spectral": 10, "spectrogram": [10, 13], "spectrum": 12, "speech": [0, 11, 12, 16, 17], "spice": 4, "spider": 4, "spirit": 10, "spm": 14, "spotifi": 14, "sr": [2, 9, 19], "src": 3, "stabl": [10, 14, 15], "stableaudio": 12, "stack": 10, "staff": 14, "stage": 6, "standard": 10, "start": [5, 6, 7, 10, 12, 16], "startup": 12, "state": [0, 6, 12], "static": 14, "statist": 11, "stem": 8, "step": [10, 11, 13], "stephen": 0, "steven": 0, "stft": [10, 13], "still": [11, 15], "stimulu": 11, "stochast": 10, "stoller": 0, "straightforward": 11, "strategi": 0, "strength": [10, 11], "string": 10, "strongli": 8, "structur": [7, 12], "student": 14, "studi": [10, 14, 16, 21], "style": [7, 10, 12, 16], "sub": 7, "subject": [3, 11], "subsequ": [4, 12], "substanti": 14, "subtl": 11, "succes": 10, "success": 6, "suitabl": [3, 11], "summar": 11, "summari": [5, 7], "summaris": 4, "sun": 0, "sunni": 3, "suno": 12, "superior": 11, "supervis": [0, 7, 14, 16, 17, 21], "supplement": [3, 14], "support": [2, 5, 9, 16, 19], "survei": [0, 14], "sutskev": 0, "symbol": 12, "synchron": 0, "synnaev": 0, "syntact": 4, "synth": 3, "synthes": 12, "synthesi": [0, 12], "synthet": 5, "system": [0, 4, 5, 6, 7, 11, 15, 21], "szu": 0, "t": [0, 3, 4, 6, 7, 10, 15], "t5": [10, 13, 15], "tag": [0, 3, 7, 8, 15, 16], "tagliasacchi": 0, "tai": 0, "taigman": 0, "take": [4, 6, 13], "tal": 0, "tan": 0, "tang": 0, "tao": 0, "target": [6, 10, 13], "task": [0, 4, 5, 6, 10, 11, 12, 13, 14, 17, 21], "taslp": 14, "taylor": [0, 14], "tbtl08": [0, 16, 17, 21], "tc02": [0, 7], "tea": 3, "teach": [0, 14], "technic": [0, 14], "techniqu": [6, 11, 14, 17], "technologi": [14, 16, 17], "tejasvi": 0, "telecommun": 11, "templat": [0, 3], "tempo": [3, 15, 16], "tempor": [0, 7, 8, 10], "tencent": 14, "tensor_numpi": [], "term": [4, 5, 7, 8, 10, 11, 15, 16, 17], "tester": 11, "text": [0, 4, 5, 7, 12, 13, 14, 15, 17], "textrm": 10, "textsubscript": 4, "textual": [11, 12, 14, 17], "tf": 4, "than": [7, 8, 10, 11, 14, 15, 16], "thank": 15, "thei": [4, 5, 6, 10, 15], "them": [3, 4, 10, 11, 12], "theme": 16, "therebi": 14, "therefor": [7, 11], "theres": 10, "thesi": 14, "theta": 10, "thi": [2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 21], "thierri": 0, "think": [7, 10], "thirti": 0, "those": 7, "three": [3, 11, 12], "through": [0, 2, 4, 5, 7, 12, 14, 15, 16, 17], "throughout": 3, "thu": [10, 14], "ti": 6, "tian": 0, "tie": 0, "tier": 14, "tight_layout": 15, "timbr": 15, "time": [0, 7, 10, 11, 12, 13, 16, 21], "timelin": 12, "timescal": 7, "timestep": 10, "timo": 0, "ting": 0, "titl": 15, "to_html": 3, "todai": 6, "todo": [], "token": [4, 6, 7, 10, 13], "too": 21, "tool": [12, 17], "top": 14, "topic": [12, 14], "torch": [2, 9, 15, 19], "torr": 0, "toutanova": 0, "tovstogan": 0, "toward": [0, 7, 10, 16], "trace": 17, "traceback": 15, "track": [3, 6, 7, 16, 21], "tradit": 12, "tradition": 7, "train": [0, 2, 3, 4, 5, 6, 7, 10, 12, 13, 14, 15, 16, 17], "transact": 0, "transcript": 14, "transfer": [0, 12], "transform": [0, 3, 10, 12, 13, 15], "transit": 17, "translat": [0, 4, 5, 6, 10, 16], "treat": [6, 10], "trend": 12, "trigger": [], "true": 15, "truth": [3, 4], "try": [15, 21], "tsa": 0, "tsne": 15, "ttm": [8, 10], "ttmr": 15, "turbo": 3, "turn": [7, 10, 17], "turnbul": 0, "tutoir": [], "tutori": [5, 7, 8, 10, 12, 14, 15, 16], "twelfth": 0, "two": [5, 6, 11, 12, 13, 16], "txt": 14, "ty": [0, 7], "type": [3, 5, 6, 7, 11, 12], "typic": [4, 6, 7, 11], "tzanetaki": 0, "u": 10, "ucsd": 0, "udio": 12, "ugen": 0, "ultim": 11, "umbrella": 5, "umg": 14, "unannot": 21, "unattribut": 8, "uncondit": [0, 16], "under": [14, 21], "understand": [0, 10, 11, 12, 14, 15, 16, 17], "unfortun": 21, "unifi": [0, 6, 15], "unigram": 4, "uniqu": 11, "unit": [0, 4], "univers": [0, 14, 16], "unlabel": 5, "unlik": [10, 11], "unresolv": 14, "unrol": 6, "unseen": 16, "unsupervis": 0, "until": 10, "up": [4, 8, 10], "upbeat": [3, 8], "uplift": 3, "url": 0, "us": [0, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 15, 16, 21], "usa": 14, "usabl": 15, "usag": 16, "user": [0, 14, 15, 17, 21], "userwarn": [], "usual": [5, 6, 7], "utf": 3, "util": [13, 17], "v": [0, 7, 10, 13], "v1": 0, "v2": [], "vae": 10, "valu": [4, 10, 11, 13], "valuabl": 11, "vampnet": 12, "van": 0, "vari": [0, 10], "variabl": [7, 10], "varianc": 11, "variant": [4, 6], "varieti": [5, 15], "variou": [6, 14, 15, 17, 21], "vasudevan": 0, "vdodz": [0, 16], "ve": 7, "vector": [10, 13, 17], "venv": [], "version": [10, 11], "verzetti": 0, "vggish": 11, "via": [0, 3, 6, 13], "vibe": 3, "video": [0, 5], "view": 21, "vijai": 0, "vincent": 0, "vinyal": 0, "vision": [0, 4, 15, 17], "visit": 0, "visual": [0, 15], "vocabulari": [7, 15, 16, 17, 21], "vocal": 3, "vocod": 10, "volum": [0, 8], "voss": 0, "w": [10, 15], "wa": [11, 12, 14, 16], "wai": [3, 4, 10, 12, 14, 15, 17], "wainwright": 0, "walk": 10, "wang": 0, "want": [3, 8, 16, 21], "warn": 15, "watanab": 0, "wav": 3, "waveform": [10, 16], "wavegan": 16, "wavenet": [0, 12, 16], "wbz": [0, 17], "wcs21": [0, 7], "wdwb23": [0, 10, 17], "we": [2, 3, 4, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 21], "weak": [0, 11], "weaker": 15, "web": 14, "weck": 0, "wei": 0, "weight": [4, 6], "weiner": 10, "weird": 10, "welcom": 14, "well": [4, 6, 8, 10, 11, 14, 15], "wen": 0, "wenhao": 0, "wenhu": 0, "wenyi": 0, "were": [4, 6, 16, 21], "what": 10, "when": [5, 7, 11, 13, 21], "whenev": 16, "where": [6, 7, 10, 11, 12, 14, 16], "whether": [7, 16], "which": [3, 4, 5, 6, 8, 10, 11, 12, 13, 14, 15, 16], "whiil": 8, "while": [6, 10, 11, 15], "whisper": [14, 17], "whistl": 3, "whitman": 16, "whole": 7, "why": [10, 14], "wide": [5, 11, 15, 16, 17], "wider": [6, 10, 15], "width": 10, "wikimut": 3, "wikipedia": 3, "william": 0, "wise": 10, "wish": 10, "within": [6, 7, 8, 15], "without": [4, 12, 15, 17], "wmb": [0, 4], "wnn": [0, 7], "wolf": 0, "won": 0, "wook": [0, 14], "word": [7, 10, 15, 16, 17, 21], "work": [2, 6, 7, 10, 12, 14], "world": [14, 16], "would": [10, 21], "wrap": 8, "write": [8, 15], "writeup": 10, "written": 3, "wu": 0, "x": [6, 7, 10, 16], "x_0": 10, "x_1": [], "x_2": [], "x_m": [], "x_t": 10, "xavier": 0, "xianzhao": 0, "xiaob": 0, "xiaohuan": 0, "xie": 0, "xu": 0, "xuezhi": 0, "y": [0, 6, 7, 16], "y_": [6, 7], "y_1": [6, 7], "y_2": [6, 7], "y_t": [6, 7], "yan": 0, "yang": 0, "yaniv": 0, "yanqi": 0, "year": [5, 6, 7, 10], "yellow": 12, "yet": [6, 16], "yeung": 0, "yi": 0, "ying": 0, "yinghao": 0, "yixiao": 0, "yoav": 0, "yonghui": 0, "york": 14, "yoshua": 0, "yossi": 0, "you": [3, 10, 14, 16], "youngmoo": 0, "your": [2, 9, 19], "yourself": 14, "youtub": [], "youtube8m": [], "yt": 3, "yt8m": [], "yu": 0, "yudong": 0, "yue": 0, "yun": 0, "yunfei": 0, "yunxuan": 0, "yusong": 0, "ywv": [0, 15], "z": 10, "zachari": [0, 14], "zal": 0, "zejun": 0, "zen": 0, "zero": 0, "zhang": 0, "zhang_bertscore_2020": [], "zhao": 0, "zhiji": 0, "zhou": 0, "zhouhang": 0, "zhouyu": 0, "zihao": 0, "zirui": 0, "zoph": 0, "zornitsa": 0, "zuchao": 0, "\u00e1": 0, "\u00e9": 0, "\u00f6": 0}, "titles": ["Bibliography", "Conclusion", "Code Tutorial", "Music Description Datasets", "Music Description Evaluation", "Overview", "Music Description Models", "Music Description Tasks", "Beyond Text-Based Interactions", "Code Tutoiral", "Diffusion Model-based Text-to-Music Generation", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Language Model", "Code Tutorial", "Conversational Retrieval", "Introduction to Text-to-Music Retrieval", "Audio-Text Joint Embedding"], "titleterms": {"1": [2, 15], "2": [2, 15], "3": 15, "A": [], "The": 3, "about": 14, "abstract": [], "adapt": 6, "aim": 14, "almost": 15, "anchor": 11, "annot": [16, 17], "answer": 7, "architectur": 10, "audio": [6, 11, 13, 14, 22], "author": 14, "automat": [4, 5], "autoregress": 6, "background": 16, "base": [4, 8, 10], "beyond": 8, "bibliographi": 0, "brief": [], "caption": [2, 3, 7], "code": [2, 9, 19], "codec": 13, "complex": [], "conclus": 1, "condit": 10, "connect": 14, "continu": 10, "convers": 20, "data": [], "databas": 17, "dataset": 3, "decod": 6, "definit": 12, "describ": 3, "descript": [3, 4, 5, 6, 7], "dialogu": 7, "diffus": 10, "distanc": 11, "divers": 11, "do": 5, "earli": 16, "embed": 22, "encod": [6, 15], "evalu": [2, 4, 11], "fad": 11, "fid": 11, "friendli": 15, "from": 7, "fr\u00e9chet": 11, "gener": [2, 7, 10, 16, 17], "get": 14, "hidden": 11, "histori": 12, "human": 15, "i": [5, 15], "incept": 11, "input": 6, "interact": 8, "interfac": 15, "introduct": [12, 21], "iter": 10, "joint": 22, "label": [7, 15], "langaug": [15, 17], "languag": [14, 17, 18], "learn": 15, "limit": 11, "listen": 11, "llm": 6, "match": 4, "mc": 3, "mean": 11, "metric": 4, "mo": 11, "model": [6, 10, 17, 18], "motiv": 14, "mtc": 3, "multimod": 6, "multipl": 11, "mushra": 11, "music": [2, 3, 4, 5, 6, 7, 10, 14, 16, 17, 21], "musiccap": 3, "musicgen": 13, "musictextclip": 3, "natur": [14, 15, 17], "need": 5, "neural": 13, "opinion": 11, "other": [4, 6], "overview": [3, 5, 17], "paradigm": 6, "part": 2, "problem": 12, "qualiti": 11, "question": 7, "refer": [3, 4, 6, 7, 11], "refin": 10, "relev": 11, "represent": [10, 15], "retriev": [16, 17, 20, 21], "sampl": 17, "scalabl": 15, "score": 11, "sdd": 3, "song": 3, "stableaudio": [], "stage": 16, "start": 14, "static": [], "stimuli": 11, "supervis": 15, "synthet": 3, "task": [7, 15], "test": 11, "text": [3, 6, 8, 10, 11, 21, 22], "through": 10, "transform": 6, "tutoir": 9, "tutori": [2, 17, 19], "type": 4, "univers": 15, "we": 5, "weak": 15, "what": 5, "why": [5, 15], "y": 15, "youtube8m": 3, "yt8m": 3, "z": 15}})
\ No newline at end of file