diff --git a/_images/encoder_decoder.png b/_images/encoder_decoder.png
new file mode 100644
index 0000000..163d337
Binary files /dev/null and b/_images/encoder_decoder.png differ
diff --git a/_images/muchomusic.png b/_images/muchomusic.png
new file mode 100644
index 0000000..ea107eb
Binary files /dev/null and b/_images/muchomusic.png differ
diff --git a/_sources/description/datasets.ipynb b/_sources/description/datasets.ipynb
index 7218604..23b2055 100644
--- a/_sources/description/datasets.ipynb
+++ b/_sources/description/datasets.ipynb
@@ -9,6 +9,8 @@
     "\n",
     "Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. \n",
     "\n",
+    "```{table} Music description datasets.\n",
+    ":name: description_datasets\n",
     "| Dataset | Content | Size (# annotations) | Accompanying Audio | Audio Length | Audio License|  Text source  | Dataset License\n",
     "| ------- |  ------ |    ---- | ---- | ---- | ---- | ---- | ---- | \n",
     "| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Captions |  5.5k | ❌ <br> (YT IDs from AudioSet)| 10s | - | Human-written (by musicians) | CC BY-SA 4.0 |\n",
@@ -22,7 +24,8 @@
     "|[MUCaps](https://huggingface.co/datasets/M2UGen/MUCaps) | Captions  | 22k | ❌ (YT IDs from AudioSet) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
     "|[MuEdit](https://huggingface.co/datasets/M2UGen/MUEdit) | Music editing instructions | 11k |  ❌ <br> (MusicCaps) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
     "|[FUTGA](https://huggingface.co/datasets/JoshuaW1997/FUTGA) | Captions (fine-grained) | 51.8k | ❌ <br> (MusicCaps, Song Describer Dataset) | 2-5min | - | Synthetic (generated from audio via FUTGA) | Apache-2.0 |\n",
-    "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | -  | - | Human-written (Amazon customers) | MIT  |"
+    "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | -  | - | Human-written (Amazon customers) | MIT  |\n",
+    "```"
    ]
   },
   {
diff --git a/_sources/description/evaluation.md b/_sources/description/evaluation.md
index 569bf8e..219caa7 100644
--- a/_sources/description/evaluation.md
+++ b/_sources/description/evaluation.md
@@ -1,6 +1,5 @@
 (description_evaluation)=
 # Evaluation
-
 Reliably evaluating music description systems is a challenging endeavour. Even when we have "grounth-truth" captions, it is not always clear how to score generated text, as music description is open-ended, and at least partially subjective. The quality of a description is also strongly dependent on the context in which it is used. This issue gets even more pronounced with more dialogue-based tasks like MQA or other forms of instruction-based description.
 Comparing outputs to gold standard from static datasets can help, but it's only the first step.
 
@@ -25,11 +24,20 @@ We briefly review each of these metrics below:
 
 * **BERT-Score** also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements.
 
-## Other types of automatic evaluation
-* Multiple-choice question answering: MuChoMusic {cite}`weck_muchomusic_2024`
-* Other benchmarks: OpenMU {cite}`zhao_openmu_2024`
-* LLM-as-a-judge
-* Non audio: {cite}`li_music_2024`
+### Limitations
+While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement {cite}`lee2024captioningmetricsreflectmusic`. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation.
+
+## Benchmarks
+To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench {cite}`wang2024audiobench` and AIR-Bench {cite}`yang-etal-2024-air`. Others, including [MuChoMusic](https://mulab-mir.github.io/muchomusic/) {cite}`weck_muchomusic_2024` and [OpenMU](https://mzhaojp22.github.io/open_music_understanding/) {cite}`zhao_openmu_2024`, directly focus on music:
+
+```{figure} ./img/muchomusic.png
+---
+name: muchomusic
+width: 400px
+align: center
+---
+
+```
 
 ## References
 
diff --git a/_sources/description/models.md b/_sources/description/models.md
index 5602016..e3ba031 100644
--- a/_sources/description/models.md
+++ b/_sources/description/models.md
@@ -3,26 +3,30 @@
 
 Deep learning models for music description via natural language typically fit into one of two designs:
 
-- Encoder-decoder
-- Multimodal Autoregressive Models
+- [Encoder-decoder](encoder_decoder_models) models 
+- [Multimodal AR](multimodal_ar) models, most often in the form of [adapted LLMs](adapted_llms)
 
-In {numref}`description_models_table` below we give an overview of music description models from 2016 to today.  * denotes taks that don't fall under the music description umbrella but are still addressed by the model.
+In {numref}`description_models_table` we give an overview of music description models from 2016 to today.  * denotes taks that don't fall under the music description umbrella but are still addressed by the model.
 
 ```{table} Music description models.
 :name: description_models_table
 | Model | Type | Task(s) | Weights | Training dataset | 
 | ------- |  ------ |    ---- | ---- | ---- | 
-| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Playlist captioning | ❌ | Private | 
-| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private | 
-| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ |  | 
-| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private | 
-| BLAP {cite}`lanzendorfer_blap_2024` | Encoder-decoder |  |  |  |
-| MuLLama {cite}`liu_music_2024` | Adapted LLM |  |  |  |
-| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM |  |  |  |
-| M2UGen{cite}`hussain2023m` | Adapted LLM |  |  |  |
-| LLark {cite}`gardner2023llark` | Adapted LLM |  |  |  |
+| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Captioning (playlist) | ❌ | Private data | 
+| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private data |
+| PlayNTell {cite}`gabbolini-etal-2022-data` | Encoder-decoder | Captioning (playlist) | ✅ [link]() | PlayNTell |  
+| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ [link](https://huggingface.co/seungheondoh/lp-music-caps) | LP-MusicCaps | 
+| ALCAP {cite}`he2023alcap` | Encoder-decoder | Captioning | ❌  | Song Interpretation Dataset, NetEase Cloud Music Review Dataset | 
+| BLAP {cite}`lanzendorfer_blap_2024` | Adapted LLM | Captioning | ✅ [link](https://huggingface.co/Tino3141/blap/tree/main) | Shutterstock (31k clips) |
+| LLark {cite}`gardner2023llark` | Adapted LLM | Captioning, MQA | ❌ |  MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune| 
+| MU-LLaMA {cite}`liu_music_2024` | Adapted LLM | Captioning, MQA |  ✅ [link](https://huggingface.co/mu-llama/MU-LLaMA/tree/main) |  MusicQA |
+| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://github.com/zihaod/MusiLingo?tab=readme-ov-file#model-checkpoints) | MusicInstruct | 
+| M2UGen{cite}`hussain2023m` | Adapted LLM | Captioning, MQA, music generation | ✅ [link](https://huggingface.co/M2UGen) | MUCaps, MUEdit | 
+| OpenMU {cite}`zhao2024openmu` | Adapted LLM | Captioning, MQA | ✅ [link]() |  MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune| 
+| FUTGA {cite}`wu2024futga` | Adapted LLM | Captioning (fine-grained) | ✅ [link](https://huggingface.co/JoshuaW1997/FUTGA) |  FUTGA| 
 ```
 
+(encoder_decoder_models)=
 ## Encoder-Decoder Models
 This is the modelling framework of the earliest DL music captioning models.
 Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music.
@@ -89,7 +93,7 @@ where $\boldsymbol{w}_{a t t}$ and $\boldsymbol{W}^{a t t}$ are learnable parame
 Similar types of attention-based fusion can also be used in Transformer-based architectures {cite}`gabbolini-etal-2022-data` {cite}`doh2023lp`. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:
 
 $$
-\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}
+\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}.
 $$
 
 
@@ -103,12 +107,13 @@ align: center
 
 In addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between *early* (i.e. at the input level), *intermediate* (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or *late* fusion (i.e. at the output level). We note that the terms *early, intermediate* and *late* fusion do not have an unequivocal definition and are used slightly differently in different works.
 
+(multimodal_ar)=
+## Multimodal AR Models
+The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these *adapted LLMs*. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.
 
-## Multimodal Autoregressive Models
-The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.
-
-Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.
+Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.
 
+(adapted_llms)=
 ### Adapted LLMs
 One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities
 such as audio. This is achieved via an *adapter* module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings. 
@@ -121,31 +126,23 @@ align: center
 ---
 ```
 
-🚧
-
-Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
-* SALMONN {cite}`tang_salmonn_2023`
-* Pengi {cite}`deshmukh_pengi_2023`
-* Qwen-Audio `chu_qwen-audio_2023`
-* LTU
-* [Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data](https://link.springer.com/chapter/10.1007/978-981-97-4399-5_13)
-
-We don't discuss these in detail, but their high-level design is similar to the music-specialised models we've seen in this section.
-
-#### Adapter Modules
+The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. [This blog post](https://lilianweng.github.io/posts/2022-06-09-vlm/) about Visual Language Models reviews these in more detail.
 
-#### Training 
 From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data. 
 
-##### Instruction Tuning 
+Alongside music-specialised multimodal LLMs such as those in {numref}`description_models_table`, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
+* SALMONN {cite}`tang_salmonn_2023`
+* Pengi {cite}`deshmukh_pengi_2023`
+* Qwen-Audio {cite}`chu_qwen-audio_2023`
+* LTU {cite}`gong2023listen`
+* Audio Flamingo {cite}`kong2024audio_flamingo`
+* Audio-LLM {cite}`zhang2024audio_llm`
 
 ### Natively Multimodal AR Models
-Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. 
+Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities {cite}`weck_muchomusic_2024`. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. 
 This paradigm is sometimes referred to as mixed-modal early-fusion modelling.
 
-It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:
-* AnyGPT {cite}`doh2023lp`
-* 
+It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT {cite}`zhan-etal-2024-anygpt`, include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. 
 
 ## References
 
diff --git a/_sources/description/tasks.md b/_sources/description/tasks.md
index 1d5bd03..d06c7a5 100644
--- a/_sources/description/tasks.md
+++ b/_sources/description/tasks.md
@@ -87,7 +87,7 @@ align: center
 
 A key difference between dialogue-based description and one-off captioning is that, instead of an `audio --> text` mapping, we are now dealing with an `(audio, text) --> text` mapping. This is reflected in the different model designs typically considered for these tasks (see [Models](description_models)). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input. 
 
-In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description-evaluation))!
+In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description_evaluation))!
 
 ## References
 
diff --git a/bibliography.html b/bibliography.html
index 5a68f36..b8cf727 100644
--- a/bibliography.html
+++ b/bibliography.html
@@ -487,6 +487,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>CFSC17<span class="fn-bracket">]</span></span>
 <p>Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Convolutional recurrent neural networks for music classification. In <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, volume, 2392–2396. 2017. <a class="reference external" href="https://doi.org/10.1109/ICASSP.2017.7952585">doi:10.1109/ICASSP.2017.7952585</a>.</p>
 </div>
+<div class="citation" id="id62" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>CXZ+23<span class="fn-bracket">]</span></span>
+<p>Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: <a class="reference external" href="http://arxiv.org/abs/2311.07919">http://arxiv.org/abs/2311.07919</a> (visited on 2024-02-26), <a class="reference external" href="https://doi.org/10.48550/arXiv.2311.07919">doi:10.48550/arXiv.2311.07919</a>.</p>
+</div>
 <div class="citation" id="id12" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>CHL+24<span class="fn-bracket">]</span></span>
 <p>Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and others. Scaling instruction-finetuned language models. <em>Journal of Machine Learning Research</em>, 25(70):1–53, 2024.</p>
@@ -611,6 +615,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>GGDRe22<span class="fn-bracket">]</span></span>
 <p>Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It's raw! audio generation with state-space models. In <em>International Conference on Machine Learning, ICML</em>, volume 162 of Proceedings of Machine Learning Research, 7616–7633. PMLR, 2022.</p>
 </div>
+<div class="citation" id="id313" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>GLL+23<span class="fn-bracket">]</span></span>
+<p>Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. <em>arXiv preprint arXiv:2305.10790</em>, 2023.</p>
+</div>
 <div class="citation" id="id296" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>GL83<span class="fn-bracket">]</span></span>
 <p>Daniel W. Griffin and Jae S. Lim. Signal estimation from modified short-time fourier transform. In <em>IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP</em>, 804–807. IEEE, 1983.</p>
@@ -667,6 +675,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>KCI+20<span class="fn-bracket">]</span></span>
 <p>Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: large-scale pretrained audio neural networks for audio pattern recognition. <em>IEEE ACM Trans. Audio Speech Lang. Process.</em>, 28:2880–2894, 2020.</p>
 </div>
+<div class="citation" id="id314" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>KGB+24<span class="fn-bracket">]</span></span>
+<p>Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In <em>ICML</em>. 2024. URL: <a class="reference external" href="https://openreview.net/forum?id=WYi3WKZjYe">https://openreview.net/forum?id=WYi3WKZjYe</a>.</p>
+</div>
 <div class="citation" id="id263" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>KWG+24<span class="fn-bracket">]</span></span>
 <p>Junghyun Koo, Gordon Wichern, Francois G Germain, Sameer Khurana, and Jonathan Le Roux. Smitin: self-monitored inference-time intervention for generative music transformers. <em>arXiv preprint arXiv:2404.02252</em>, 2024.</p>
@@ -691,6 +703,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>Lee10<span class="fn-bracket">]</span></span>
 <p>Jin Ha Lee. Analysis of user needs and information features in natural language queries seeking music information. <em>Journal of the American Society for Information Science and Technology</em>, 61(5):1025–1045, 2010.</p>
 </div>
+<div class="citation" id="id307" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>LL24<span class="fn-bracket">]</span></span>
+<p>Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In <em>International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD)</em>. 2024. URL: <a class="reference external" href="https://arxiv.org/abs/2411.11692">https://arxiv.org/abs/2411.11692</a>.</p>
+</div>
 <div class="citation" id="id50" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>LN17<span class="fn-bracket">]</span></span>
 <p>Jongpil Lee and Juhan Nam. Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. <em>IEEE Signal Processing Letters</em>, 24(8):1208–1212, 2017. <a class="reference external" href="https://doi.org/10.1109/LSP.2017.2713830">doi:10.1109/LSP.2017.2713830</a>.</p>
@@ -703,9 +719,9 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>LGW+23<span class="fn-bracket">]</span></span>
 <p>Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, and Tom Nickson. Controllable music production with diffusion models and guidance gradients. In <em>Diffusion Models Workshop at NeurIPS</em>. 2023.</p>
 </div>
-<div class="citation" id="id56" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>LYT+24<span class="fn-bracket">]</span></span>
-<p>Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. <em>arXiv preprint arXiv:2406.15885</em>, 2024.</p>
+<div class="citation" id="id312" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>LTL24<span class="fn-bracket">]</span></span>
+<p>Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, <em>Advances in Neural Networks – ISNN 2024</em>, 133–142. Singapore, 2024. Springer Nature Singapore.</p>
 </div>
 <div class="citation" id="id76" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>LCY+23a<span class="fn-bracket">]</span></span>
@@ -899,6 +915,10 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>VSP+17<span class="fn-bracket">]</span></span>
 <p>Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In <em>Neural Information Processing Systems (NeurIPS)</em>. 2017.</p>
 </div>
+<div class="citation" id="id308" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>WZL+24<span class="fn-bracket">]</span></span>
+<p>Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. <em>arXiv preprint arXiv:2406.16020</em>, 2024.</p>
+</div>
 <div class="citation" id="id281" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>WZZ+20<span class="fn-bracket">]</span></span>
 <p>Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang, Ruihan Yang, Gus Xia, and Junbo Zhao. PIANOTREE VAE: structured representation learning for polyphonic music. In <em>Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR</em>, 368–375. 2020.</p>
@@ -943,16 +963,28 @@ <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this h
 <span class="label"><span class="fn-bracket">[</span>YCY17<span class="fn-bracket">]</span></span>
 <p>Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. In <em>Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR</em>, 324–331. 2017.</p>
 </div>
+<div class="citation" id="id309" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>YXL+24<span class="fn-bracket">]</span></span>
+<p>Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: benchmarking large audio-language models via generative comprehension. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 1979–1998. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2024.acl-long.109">https://aclanthology.org/2024.acl-long.109</a>, <a class="reference external" href="https://doi.org/10.18653/v1/2024.acl-long.109">doi:10.18653/v1/2024.acl-long.109</a>.</p>
+</div>
 <div class="citation" id="id29" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>YWV+22<span class="fn-bracket">]</span></span>
 <p>Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: contrastive captioners are image-text foundation models. <em>arXiv preprint arXiv:2205.01917</em>, 2022.</p>
 </div>
+<div class="citation" id="id310" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>ZDY+24<span class="fn-bracket">]</span></span>
+<p>Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2024.acl-long.521">https://aclanthology.org/2024.acl-long.521</a>, <a class="reference external" href="https://doi.org/10.18653/v1/2024.acl-long.521">doi:10.18653/v1/2024.acl-long.521</a>.</p>
+</div>
 <div class="citation" id="id85" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>ZIX+24<span class="fn-bracket">]</span></span>
 <p>Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, and Simon Dixon. Musicmagus: zero-shot text-to-music editing via diffusion models. <em>arXiv preprint arXiv:2402.06178</em>, 2024.</p>
 </div>
+<div class="citation" id="id311" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>ZZM+24a<span class="fn-bracket">]</span></span>
+<p>Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. <em>arXiv preprint arXiv:2410.15573</em>, 2024.</p>
+</div>
 <div class="citation" id="id306" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>ZZM+24<span class="fn-bracket">]</span></span>
+<span class="label"><span class="fn-bracket">[</span>ZZM+24b<span class="fn-bracket">]</span></span>
 <p>Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: <a class="reference external" href="http://arxiv.org/abs/2410.15573">http://arxiv.org/abs/2410.15573</a> (visited on 2024-11-09).</p>
 </div>
 <div class="citation" id="id259" role="doc-biblioentry">
diff --git a/description/datasets.html b/description/datasets.html
index 9139ac8..b830ca3 100644
--- a/description/datasets.html
+++ b/description/datasets.html
@@ -439,7 +439,8 @@ <h2> Contents </h2>
   <section class="tex2jax_ignore mathjax_ignore" id="datasets">
 <span id="caption-datasets"></span><h1>Datasets<a class="headerlink" href="#datasets" title="Link to this heading">#</a></h1>
 <p>Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn’t directly distributed with the dataset and may be subject to copyright.</p>
-<div class="pst-scrollable-table-container"><table class="table">
+<div class="pst-scrollable-table-container"><table class="table" id="description-datasets">
+<caption><span class="caption-number">Table 2 </span><span class="caption-text">Music description datasets.</span><a class="headerlink" href="#description-datasets" title="Link to this table">#</a></caption>
 <thead>
 <tr class="row-odd"><th class="head"><p>Dataset</p></th>
 <th class="head"><p>Content</p></th>
@@ -565,7 +566,7 @@ <h2> Contents </h2>
 </div>
 <section id="human-written-text">
 <h2>Human-written text<a class="headerlink" href="#human-written-text" title="Link to this heading">#</a></h2>
-<p>Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps <span id="id1">[<a class="reference internal" href="#id311" title="Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and others. Musiclm: generating music from text. arXiv preprint arXiv:2301.11325, 2023.">ADB+23</a>]</span>, the Song Describer Dataset <span id="id2">[<a class="reference internal" href="#id310" title="Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bodganov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, and others. The song describer dataset: a corpus of audio captions for music-and-language evaluation. arXiv preprint arXiv:2311.10057, 2023.">MWD+23</a>]</span> and YouTube8M-MusicTextClips <span id="id3">[<a class="reference internal" href="#id55" title="Daniel McKee, Justin Salamon, Josef Sivic, and Bryan Russell. Language-guided music recommendation for video via prompt analogies. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume, 14784-14793. 2023. doi:10.1109/CVPR52729.2023.01420.">MSSR23</a>]</span>.</p>
+<p>Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps <span id="id1">[<a class="reference internal" href="../introduction/overview.html#id326" title="Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and others. Musiclm: generating music from text. arXiv preprint arXiv:2301.11325, 2023.">ADB+23</a>]</span>, the Song Describer Dataset <span id="id2">[<a class="reference internal" href="../retrieval/challenge.html#id305" title="Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bodganov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, and others. The song describer dataset: a corpus of audio captions for music-and-language evaluation. arXiv preprint arXiv:2311.10057, 2023.">MWD+23</a>]</span> and YouTube8M-MusicTextClips <span id="id3">[<a class="reference internal" href="#id55" title="Daniel McKee, Justin Salamon, Josef Sivic, and Bryan Russell. Language-guided music recommendation for video via prompt analogies. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume, 14784-14793. 2023. doi:10.1109/CVPR52729.2023.01420.">MSSR23</a>]</span>.</p>
 <p>Some example of music captions from the SDD are shown below:</p>
 <div class="cell tag_hide-input docutils container">
 <details class="hide above-input">
@@ -667,7 +668,7 @@ <h2>Human-written text<a class="headerlink" href="#human-written-text" title="Li
 </section>
 <section id="synthetic-text">
 <h2>Synthetic Text<a class="headerlink" href="#synthetic-text" title="Link to this heading">#</a></h2>
-<p>Datasets with human-provided captions, particularly MusicCaps and SDD, or tags often form the basis of other derived audio-text music datasets. Among these, some transform existing annotations by use of text templates (e.g. MusicBench <span id="id4">[<a class="reference internal" href="#id265" title="Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: toward controllable text-to-music generation. 2023.">MGG+23</a>]</span>), or LLM-enabled augmentation (e.g. MusicQA <span id="id5">[<a class="reference internal" href="models.html#id77" title="Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 286-290. 2024. doi:10.1109/ICASSP48485.2024.10447027.">LHSS24</a>]</span>, MusicInstruct <span id="id6">[<a class="reference internal" href="models.html#id79" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span>) to obtain different kinds of text annotation such as more captions or question-answer pairs. In other cases, like in the MUCaps <span id="id7">[<a class="reference internal" href="models.html#id81" title="Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255, 2023.">HLSS23</a>]</span> and FUTGA <span id="id8">[<a class="reference internal" href="#id71" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span> datasets, synthetic text annotations are instead produced based on the audio itself, by means of auxiliary audio captioning models.</p>
+<p>Datasets with human-provided captions, particularly MusicCaps and SDD, or tags often form the basis of other derived audio-text music datasets. Among these, some transform existing annotations by use of text templates (e.g. MusicBench <span id="id4">[<a class="reference internal" href="#id265" title="Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: toward controllable text-to-music generation. 2023.">MGG+23</a>]</span>), or LLM-enabled augmentation (e.g. MusicQA <span id="id5">[<a class="reference internal" href="models.html#id87" title="Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 286-290. 2024. doi:10.1109/ICASSP48485.2024.10447027.">LHSS24</a>]</span>, MusicInstruct <span id="id6">[<a class="reference internal" href="tasks.html#id73" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span>) to obtain different kinds of text annotation such as more captions or question-answer pairs. In other cases, like in the MUCaps <span id="id7">[<a class="reference internal" href="models.html#id91" title="Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255, 2023.">HLSS23</a>]</span> and FUTGA <span id="id8">[<a class="reference internal" href="tasks.html#id79" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span> datasets, synthetic text annotations are instead produced based on the audio itself, by means of auxiliary audio captioning models.</p>
 </section>
 <section id="references">
 <h2>References<a class="headerlink" href="#references" title="Link to this heading">#</a></h2>
diff --git a/description/evaluation.html b/description/evaluation.html
index b56501d..b1b2dc9 100644
--- a/description/evaluation.html
+++ b/description/evaluation.html
@@ -424,8 +424,11 @@ <h2> Contents </h2>
             </div>
             <nav aria-label="Page">
                 <ul class="visible nav section-nav flex-column">
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#match-based-metrics">Match-based metrics</a></li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#other-types-of-automatic-evaluation">Other types of automatic evaluation</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#match-based-metrics">Match-based metrics</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#limitations">Limitations</a></li>
+</ul>
+</li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#benchmarks">Benchmarks</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#references">References</a></li>
 </ul>
             </nav>
@@ -456,30 +459,40 @@ <h2>Match-based metrics<a class="headerlink" href="#match-based-metrics" title="
 <li><p><strong>SPIDEr</strong> is a linear combination of SPICE and CIDEr.</p></li>
 <li><p><strong>BERT-Score</strong> also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements.</p></li>
 </ul>
+<section id="limitations">
+<h3>Limitations<a class="headerlink" href="#limitations" title="Link to this heading">#</a></h3>
+<p>While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement <span id="id1">[<a class="reference internal" href="#id312" title="Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD). 2024. URL: https://arxiv.org/abs/2411.11692.">LL24</a>]</span>. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation.</p>
 </section>
-<section id="other-types-of-automatic-evaluation">
-<h2>Other types of automatic evaluation<a class="headerlink" href="#other-types-of-automatic-evaluation" title="Link to this heading">#</a></h2>
-<ul class="simple">
-<li><p>Multiple-choice question answering: MuChoMusic <span id="id1">[<a class="reference internal" href="#id61" title="Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In 25th International Society for Music Information Retrieval Conference. August 2024. arXiv:2408.01337 [cs, eess]. URL: http://arxiv.org/abs/2408.01337 (visited on 2024-08-21), doi:10.48550/arXiv.2408.01337.">WMB+24</a>]</span></p></li>
-<li><p>Other benchmarks: OpenMU <span id="id2">[<a class="reference internal" href="#id309" title="Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: http://arxiv.org/abs/2410.15573 (visited on 2024-11-09).">ZZM+24</a>]</span></p></li>
-<li><p>LLM-as-a-judge</p></li>
-<li><p>Non audio: <span id="id3">[<a class="reference internal" href="#id59" title="Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. arXiv preprint arXiv:2406.15885, 2024.">LYT+24</a>]</span></p></li>
-</ul>
+</section>
+<section id="benchmarks">
+<h2>Benchmarks<a class="headerlink" href="#benchmarks" title="Link to this heading">#</a></h2>
+<p>To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench <span id="id2">[<a class="reference internal" href="#id313" title="Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020, 2024.">WZL+24</a>]</span> and AIR-Bench <span id="id3">[<a class="reference internal" href="#id314" title="Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: benchmarking large audio-language models via generative comprehension. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1979–1998. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.109, doi:10.18653/v1/2024.acl-long.109.">YXL+24</a>]</span>. Others, including <a class="reference external" href="https://mulab-mir.github.io/muchomusic/">MuChoMusic</a> <span id="id4">[<a class="reference internal" href="models.html#id90" title="Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In 25th International Society for Music Information Retrieval Conference. August 2024. arXiv:2408.01337 [cs, eess]. URL: http://arxiv.org/abs/2408.01337 (visited on 2024-08-21), doi:10.48550/arXiv.2408.01337.">WMB+24</a>]</span> and <a class="reference external" href="https://mzhaojp22.github.io/open_music_understanding/">OpenMU</a> <span id="id5">[<a class="reference internal" href="#id311" title="Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: http://arxiv.org/abs/2410.15573 (visited on 2024-11-09).">ZZM+24</a>]</span>, directly focus on music:</p>
+<figure class="align-center" id="muchomusic">
+<a class="reference internal image-reference" href="../_images/muchomusic.png"><img alt="../_images/muchomusic.png" src="../_images/muchomusic.png" style="width: 400px;" /></a>
+</figure>
 </section>
 <section id="references">
 <h2>References<a class="headerlink" href="#references" title="Link to this heading">#</a></h2>
-<div class="docutils container" id="id4">
+<div class="docutils container" id="id6">
 <div role="list" class="citation-list">
-<div class="citation" id="id59" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id3">LYT+24</a><span class="fn-bracket">]</span></span>
-<p>Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. <em>arXiv preprint arXiv:2406.15885</em>, 2024.</p>
+<div class="citation" id="id312" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id1">LL24</a><span class="fn-bracket">]</span></span>
+<p>Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In <em>International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD)</em>. 2024. URL: <a class="reference external" href="https://arxiv.org/abs/2411.11692">https://arxiv.org/abs/2411.11692</a>.</p>
 </div>
-<div class="citation" id="id61" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id1">WMB+24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id313" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id2">WZL+24</a><span class="fn-bracket">]</span></span>
+<p>Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. <em>arXiv preprint arXiv:2406.16020</em>, 2024.</p>
+</div>
+<div class="citation" id="id63" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id4">WMB+24</a><span class="fn-bracket">]</span></span>
 <p>Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In <em>25th International Society for Music Information Retrieval Conference</em>. August 2024. arXiv:2408.01337 [cs, eess]. URL: <a class="reference external" href="http://arxiv.org/abs/2408.01337">http://arxiv.org/abs/2408.01337</a> (visited on 2024-08-21), <a class="reference external" href="https://doi.org/10.48550/arXiv.2408.01337">doi:10.48550/arXiv.2408.01337</a>.</p>
 </div>
-<div class="citation" id="id309" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id2">ZZM+24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id314" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id3">YXL+24</a><span class="fn-bracket">]</span></span>
+<p>Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: benchmarking large audio-language models via generative comprehension. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 1979–1998. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2024.acl-long.109">https://aclanthology.org/2024.acl-long.109</a>, <a class="reference external" href="https://doi.org/10.18653/v1/2024.acl-long.109">doi:10.18653/v1/2024.acl-long.109</a>.</p>
+</div>
+<div class="citation" id="id311" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id5">ZZM+24</a><span class="fn-bracket">]</span></span>
 <p>Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: <a class="reference external" href="http://arxiv.org/abs/2410.15573">http://arxiv.org/abs/2410.15573</a> (visited on 2024-11-09).</p>
 </div>
 </div>
@@ -551,8 +564,11 @@ <h2>References<a class="headerlink" href="#references" title="Link to this headi
   </div>
   <nav class="bd-toc-nav page-toc">
     <ul class="visible nav section-nav flex-column">
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#match-based-metrics">Match-based metrics</a></li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#other-types-of-automatic-evaluation">Other types of automatic evaluation</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#match-based-metrics">Match-based metrics</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#limitations">Limitations</a></li>
+</ul>
+</li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#benchmarks">Benchmarks</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#references">References</a></li>
 </ul>
   </nav></div>
diff --git a/description/models.html b/description/models.html
index 7d5c754..9ad2b7c 100644
--- a/description/models.html
+++ b/description/models.html
@@ -429,15 +429,8 @@ <h2> Contents </h2>
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#conditioning-and-fusion">Conditioning and Fusion</a></li>
 </ul>
 </li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multimodal-autoregressive-models">Multimodal Autoregressive Models</a><ul class="nav section-nav flex-column">
-<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#adapted-llms">Adapted LLMs</a><ul class="nav section-nav flex-column">
-<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#adapter-modules">Adapter Modules</a></li>
-<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#training">Training</a><ul class="nav section-nav flex-column">
-<li class="toc-h5 nav-item toc-entry"><a class="reference internal nav-link" href="#instruction-tuning">Instruction Tuning</a></li>
-</ul>
-</li>
-</ul>
-</li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multimodal-ar-models">Multimodal AR Models</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#adapted-llms">Adapted LLMs</a></li>
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#natively-multimodal-ar-models">Natively Multimodal AR Models</a></li>
 </ul>
 </li>
@@ -457,10 +450,10 @@ <h2> Contents </h2>
 <span id="description-models"></span><h1>Models<a class="headerlink" href="#models" title="Link to this heading">#</a></h1>
 <p>Deep learning models for music description via natural language typically fit into one of two designs:</p>
 <ul class="simple">
-<li><p>Encoder-decoder</p></li>
-<li><p>Multimodal Autoregressive Models</p></li>
+<li><p><a class="reference internal" href="#encoder-decoder-models"><span class="std std-ref">Encoder-decoder</span></a> models</p></li>
+<li><p><a class="reference internal" href="#multimodal-ar"><span class="std std-ref">Multimodal AR</span></a> models, most often in the form of <a class="reference internal" href="#adapted-llms"><span class="std std-ref">adapted LLMs</span></a></p></li>
 </ul>
-<p>In <a class="reference internal" href="#description-models-table"><span class="std std-numref">Table 1</span></a> below we give an overview of music description models from 2016 to today.  * denotes taks that don’t fall under the music description umbrella but are still addressed by the model.</p>
+<p>In <a class="reference internal" href="#description-models-table"><span class="std std-numref">Table 1</span></a> we give an overview of music description models from 2016 to today.  * denotes taks that don’t fall under the music description umbrella but are still addressed by the model.</p>
 <div class="pst-scrollable-table-container"><table class="table" id="description-models-table">
 <caption><span class="caption-number">Table 1 </span><span class="caption-text">Music description models.</span><a class="headerlink" href="#description-models-table" title="Link to this table">#</a></caption>
 <thead>
@@ -472,65 +465,83 @@ <h2> Contents </h2>
 </tr>
 </thead>
 <tbody>
-<tr class="row-even"><td><p>Choi <em>et al.</em> <span id="id1">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
+<tr class="row-even"><td><p>Choi <em>et al.</em> <span id="id1">[<a class="reference internal" href="#id54" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
 <td><p>Encoder-decoder</p></td>
-<td><p>Playlist captioning</p></td>
+<td><p>Captioning (playlist)</p></td>
 <td><p>❌</p></td>
-<td><p>Private</p></td>
+<td><p>Private data</p></td>
 </tr>
-<tr class="row-odd"><td><p>MusCaps <span id="id2">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
+<tr class="row-odd"><td><p>MusCaps <span id="id2">[<a class="reference internal" href="#id54" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
 <td><p>Encoder-decoder</p></td>
 <td><p>Captioning, retrieval*</p></td>
 <td><p>❌</p></td>
-<td><p>Private</p></td>
+<td><p>Private data</p></td>
 </tr>
-<tr class="row-even"><td><p>LP-MusicCaps <span id="id3">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
+<tr class="row-even"><td><p>PlayNTell <span id="id3">[<a class="reference internal" href="#id78" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span></p></td>
+<td><p>Encoder-decoder</p></td>
+<td><p>Captioning (playlist)</p></td>
+<td><p>✅ <a class="reference internal" href="#"><span class="xref myst">link</span></a></p></td>
+<td><p>PlayNTell</p></td>
+</tr>
+<tr class="row-odd"><td><p>LP-MusicCaps <span id="id4">[<a class="reference internal" href="#id54" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
 <td><p>Encoder-decoder</p></td>
 <td><p>Captioning</p></td>
-<td><p>✅</p></td>
-<td><p></p></td>
+<td><p>✅ <a class="reference external" href="https://huggingface.co/seungheondoh/lp-music-caps">link</a></p></td>
+<td><p>LP-MusicCaps</p></td>
 </tr>
-<tr class="row-odd"><td><p>MusCaps <span id="id4">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span></p></td>
+<tr class="row-even"><td><p>ALCAP <span id="id5">[<a class="reference internal" href="#id302" title="Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, and Xuchen Song. Alcap: alignment-augmented music captioner. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16501–16512. 2023.">HHL+23</a>]</span></p></td>
 <td><p>Encoder-decoder</p></td>
-<td><p>Captioning, retrieval*</p></td>
+<td><p>Captioning</p></td>
 <td><p>❌</p></td>
-<td><p>Private</p></td>
+<td><p>Song Interpretation Dataset, NetEase Cloud Music Review Dataset</p></td>
 </tr>
-<tr class="row-even"><td><p>BLAP <span id="id5">[<a class="reference internal" href="#id327" title="Luca A Lanzendorfer, Nathanal Perraudin, Constantin Pinkl, and Roger Wattenhofer. BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning. In Workshop on AI-Driven Speech, Music, and Sound Generation. 2024.">LPPW24</a>]</span></p></td>
-<td><p>Encoder-decoder</p></td>
-<td><p></p></td>
-<td><p></p></td>
-<td><p></p></td>
+<tr class="row-odd"><td><p>BLAP <span id="id6">[<a class="reference internal" href="#id337" title="Luca A Lanzendorfer, Nathanal Perraudin, Constantin Pinkl, and Roger Wattenhofer. BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning. In Workshop on AI-Driven Speech, Music, and Sound Generation. 2024.">LPPW24</a>]</span></p></td>
+<td><p>Adapted LLM</p></td>
+<td><p>Captioning</p></td>
+<td><p>✅ <a class="reference external" href="https://huggingface.co/Tino3141/blap/tree/main">link</a></p></td>
+<td><p>Shutterstock (31k clips)</p></td>
+</tr>
+<tr class="row-even"><td><p>LLark <span id="id7">[<a class="reference internal" href="#id55" title="Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.">GDSB23</a>]</span></p></td>
+<td><p>Adapted LLM</p></td>
+<td><p>Captioning, MQA</p></td>
+<td><p>❌</p></td>
+<td><p>MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune</p></td>
 </tr>
-<tr class="row-odd"><td><p>MuLLama <span id="id6">[<a class="reference internal" href="#id77" title="Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 286-290. 2024. doi:10.1109/ICASSP48485.2024.10447027.">LHSS24</a>]</span></p></td>
+<tr class="row-odd"><td><p>MU-LLaMA <span id="id8">[<a class="reference internal" href="#id87" title="Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 286-290. 2024. doi:10.1109/ICASSP48485.2024.10447027.">LHSS24</a>]</span></p></td>
 <td><p>Adapted LLM</p></td>
-<td><p></p></td>
-<td><p></p></td>
-<td><p></p></td>
+<td><p>Captioning, MQA</p></td>
+<td><p>✅ <a class="reference external" href="https://huggingface.co/mu-llama/MU-LLaMA/tree/main">link</a></p></td>
+<td><p>MusicQA</p></td>
 </tr>
-<tr class="row-even"><td><p>MusiLingo <span id="id7">[<a class="reference internal" href="#id79" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span></p></td>
+<tr class="row-even"><td><p>MusiLingo <span id="id9">[<a class="reference internal" href="#id89" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span></p></td>
 <td><p>Adapted LLM</p></td>
-<td><p></p></td>
-<td><p></p></td>
-<td><p></p></td>
+<td><p>Captioning, MQA</p></td>
+<td><p>✅ <a class="reference external" href="https://github.com/zihaod/MusiLingo?tab=readme-ov-file#model-checkpoints">link</a></p></td>
+<td><p>MusicInstruct</p></td>
 </tr>
-<tr class="row-odd"><td><p>M2UGen<span id="id8">[<a class="reference internal" href="#id81" title="Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255, 2023.">HLSS23</a>]</span></p></td>
+<tr class="row-odd"><td><p>M2UGen<span id="id10">[<a class="reference internal" href="#id91" title="Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255, 2023.">HLSS23</a>]</span></p></td>
 <td><p>Adapted LLM</p></td>
-<td><p></p></td>
-<td><p></p></td>
-<td><p></p></td>
+<td><p>Captioning, MQA, music generation</p></td>
+<td><p>✅ <a class="reference external" href="https://huggingface.co/M2UGen">link</a></p></td>
+<td><p>MUCaps, MUEdit</p></td>
 </tr>
-<tr class="row-even"><td><p>LLark <span id="id9">[<a class="reference internal" href="#id45" title="Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.">GDSB23</a>]</span></p></td>
+<tr class="row-even"><td><p>OpenMU <span id="id11">[<a class="reference internal" href="#id343" title="Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. arXiv preprint arXiv:2410.15573, 2024.">ZZM+24</a>]</span></p></td>
 <td><p>Adapted LLM</p></td>
-<td><p></p></td>
-<td><p></p></td>
-<td><p></p></td>
+<td><p>Captioning, MQA</p></td>
+<td><p>✅ <a class="reference internal" href="#"><span class="xref myst">link</span></a></p></td>
+<td><p>MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune</p></td>
+</tr>
+<tr class="row-odd"><td><p>FUTGA <span id="id12">[<a class="reference internal" href="#id95" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span></p></td>
+<td><p>Adapted LLM</p></td>
+<td><p>Captioning (fine-grained)</p></td>
+<td><p>✅ <a class="reference external" href="https://huggingface.co/JoshuaW1997/FUTGA">link</a></p></td>
+<td><p>FUTGA</p></td>
 </tr>
 </tbody>
 </table>
 </div>
 <section id="encoder-decoder-models">
-<h2>Encoder-Decoder Models<a class="headerlink" href="#encoder-decoder-models" title="Link to this heading">#</a></h2>
+<span id="id13"></span><h2>Encoder-Decoder Models<a class="headerlink" href="#encoder-decoder-models" title="Link to this heading">#</a></h2>
 <p>This is the modelling framework of the earliest DL music captioning models.
 Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music.</p>
 <p>As the name suggests, models of this type are composed of two main modules: an <em>encoder</em> and a <em>decoder</em>. Although there are several variations, in the simplest design of these models, the encoder is resposible for processing the
@@ -548,10 +559,10 @@ <h2>Encoder-Decoder Models<a class="headerlink" href="#encoder-decoder-models" t
 <section id="architectures">
 <h3>Architectures<a class="headerlink" href="#architectures" title="Link to this heading">#</a></h3>
 <p>When it comes to the design of the encoder and decoder components, the general philosophy is to adopt state-of-the-art architectures for the respective modalities, balancing our requirements around possible domain-specific restrictions (e.g. the need to capture features at different timescales in music signals), with the computational and data budget we have at our disposal. This is to say that there are many possible designs for encoder-decoder music captioners in theory, but most follow standard choices. Let’s review some below.</p>
-<p>The first example of encoder-decoder model for music description appeared in work by Choi <em>et al.</em> <span id="id10">[<a class="reference internal" href="#id70" title="Keunwoo Choi, George Fazekas, and Mark Sandler. Towards music captioning: generating music playlist descriptions. In Extended abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference. 08 2016. doi:10.48550/arXiv.1608.04868.">CFS16</a>]</span>. While this did not yet produce well-formed sentences, a later model by Manco <em>et al.</em>, MusCaps <span id="id11">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span>, consolidated the use of a similar architecture for track-level music captioning. These early iterations of encoder-decoder music captioners employed CNN-based audio encoders alongside
-RNN-based language decoders. More recent iterations of this framework typically make use of a Transformer-based language decoder (e.g. based on Transformer decoders such as GPT-2 <span id="id12">[<a class="reference internal" href="#id68" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> or BART <span id="id13">[<a class="reference internal" href="#id46" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>), alongside CNNs <span id="id14">[<a class="reference internal" href="#id68" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> or Transformer audio encoders <span id="id15">[<a class="reference internal" href="#id291" title="Nikita Srivatsan, Ke Chen, Shlomo Dubnov, and Taylor Berg-Kirkpatrick. Retrieval guided music captioning via multimodal prefixes. In Kate Larson, editor, Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 7762–7770. International Joint Conferences on Artificial Intelligence Organization, 8 2024. AI, Arts &amp; Creativity. URL: https://doi.org/10.24963/ijcai.2024/859, doi:10.24963/ijcai.2024/859.">SCDBK24</a>]</span>, and sometimes a hybrid of both <span id="id16">[<a class="reference internal" href="#id46" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>.</p>
+<p>The first example of encoder-decoder model for music description appeared in work by Choi <em>et al.</em> <span id="id14">[<a class="reference internal" href="#id80" title="Keunwoo Choi, George Fazekas, and Mark Sandler. Towards music captioning: generating music playlist descriptions. In Extended abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference. 08 2016. doi:10.48550/arXiv.1608.04868.">CFS16</a>]</span>. While this did not yet produce well-formed sentences, a later model by Manco <em>et al.</em>, MusCaps <span id="id15">[<a class="reference internal" href="#id54" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span>, consolidated the use of a similar architecture for track-level music captioning. These early iterations of encoder-decoder music captioners employed CNN-based audio encoders alongside
+RNN-based language decoders. More recent iterations of this framework typically make use of a Transformer-based language decoder (e.g. based on Transformer decoders such as GPT-2 <span id="id16">[<a class="reference internal" href="#id78" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> or BART <span id="id17">[<a class="reference internal" href="#id56" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>), alongside CNNs <span id="id18">[<a class="reference internal" href="#id78" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> or Transformer audio encoders <span id="id19">[<a class="reference internal" href="#id301" title="Nikita Srivatsan, Ke Chen, Shlomo Dubnov, and Taylor Berg-Kirkpatrick. Retrieval guided music captioning via multimodal prefixes. In Kate Larson, editor, Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 7762–7770. International Joint Conferences on Artificial Intelligence Organization, 8 2024. AI, Arts &amp; Creativity. URL: https://doi.org/10.24963/ijcai.2024/859, doi:10.24963/ijcai.2024/859.">SCDBK24</a>]</span>, and sometimes a hybrid of both <span id="id20">[<a class="reference internal" href="#id56" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>.</p>
 <figure class="align-center" id="encoder-decoder">
-<a class="reference internal image-reference" href="description/img/encoder_decoder.png"><img alt="description/img/encoder_decoder.png" src="description/img/encoder_decoder.png" style="width: 600px;" /></a>
+<a class="reference internal image-reference" href="../_images/encoder_decoder.png"><img alt="../_images/encoder_decoder.png" src="../_images/encoder_decoder.png" style="width: 600px;" /></a>
 </figure>
 </section>
 <section id="conditioning-and-fusion">
@@ -562,7 +573,7 @@ <h3>Conditioning and Fusion<a class="headerlink" href="#conditioning-and-fusion"
 \boldsymbol{h}_0 = \boldsymbol{a}.
 \]</div>
 <p>In most cases, however, we deal with more sophisticated architectures, and conditioning is realised through <strong>fusion</strong> of audio and text representations.
-Earlier models with RNN-based text decoders employ a range of fusion mechanisms, such as feature concatenation or cross-modal attention <span id="id17">[<a class="reference internal" href="#id44" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span>. Concatenation as a modality fusion mechanism in RNNs typically consists of concatenating an audio embedding (e.g. the output of the encoder module <span class="math notranslate nohighlight">\(\boldsymbol{a}\)</span>) to the input <span class="math notranslate nohighlight">\(\boldsymbol{x}\)</span>, so that an RNN state <span class="math notranslate nohighlight">\(\boldsymbol{h}\)</span> depends on <span class="math notranslate nohighlight">\([\boldsymbol{a}; \boldsymbol{x}]\)</span>, or to the previous state vector <span class="math notranslate nohighlight">\([\boldsymbol{a}; \boldsymbol{h}_{t-1}]\)</span>, and sometimes to both. In this case, we assume that the encoder produces a single audio embedding.</p>
+Earlier models with RNN-based text decoders employ a range of fusion mechanisms, such as feature concatenation or cross-modal attention <span id="id21">[<a class="reference internal" href="#id54" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span>. Concatenation as a modality fusion mechanism in RNNs typically consists of concatenating an audio embedding (e.g. the output of the encoder module <span class="math notranslate nohighlight">\(\boldsymbol{a}\)</span>) to the input <span class="math notranslate nohighlight">\(\boldsymbol{x}\)</span>, so that an RNN state <span class="math notranslate nohighlight">\(\boldsymbol{h}\)</span> depends on <span class="math notranslate nohighlight">\([\boldsymbol{a}; \boldsymbol{x}]\)</span>, or to the previous state vector <span class="math notranslate nohighlight">\([\boldsymbol{a}; \boldsymbol{h}_{t-1}]\)</span>, and sometimes to both. In this case, we assume that the encoder produces a single audio embedding.</p>
 <p>If our encoder produces instead a sequence of audio embeddings, and we wish to retain the sequential nature of the conditioning signal, an alternative way to achieve fusion is through <strong>cross-attention</strong>. In this case, instead of concatenating the same audio embedding at every time step <span class="math notranslate nohighlight">\(t\)</span>, we can compute attention scores <span class="math notranslate nohighlight">\(\beta_{t i}\)</span> to suitably weigh each item in the audio sequence <span class="math notranslate nohighlight">\(\boldsymbol{a}_i\)</span> differently at each time step <span class="math notranslate nohighlight">\(t\)</span>:</p>
 <div class="math notranslate nohighlight">
 \[
@@ -579,10 +590,10 @@ <h3>Conditioning and Fusion<a class="headerlink" href="#conditioning-and-fusion"
 e_{t i}=\boldsymbol{w}_{a t t}^{\top} \tanh \left(\boldsymbol{W}^{a t t} [\boldsymbol{a}; \boldsymbol{h}_{t-1}]\right),
 \]</div>
 <p>where <span class="math notranslate nohighlight">\(\boldsymbol{w}_{a t t}\)</span> and <span class="math notranslate nohighlight">\(\boldsymbol{W}^{a t t}\)</span> are learnable parameters.</p>
-<p>Similar types of attention-based fusion can also be used in Transformer-based architectures <span id="id18">[<a class="reference internal" href="#id68" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> <span id="id19">[<a class="reference internal" href="#id46" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:</p>
+<p>Similar types of attention-based fusion can also be used in Transformer-based architectures <span id="id22">[<a class="reference internal" href="#id78" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span> <span id="id23">[<a class="reference internal" href="#id56" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span>. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:</p>
 <div class="math notranslate nohighlight">
 \[
-\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}
+\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}.
 \]</div>
 <figure class="align-center" id="lp-musiccaps">
 <a class="reference internal image-reference" href="../_images/lp_musiccaps.png"><img alt="../_images/lp_musiccaps.png" src="../_images/lp_musiccaps.png" style="width: 500px;" /></a>
@@ -590,104 +601,127 @@ <h3>Conditioning and Fusion<a class="headerlink" href="#conditioning-and-fusion"
 <p>In addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between <em>early</em> (i.e. at the input level), <em>intermediate</em> (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or <em>late</em> fusion (i.e. at the output level). We note that the terms <em>early, intermediate</em> and <em>late</em> fusion do not have an unequivocal definition and are used slightly differently in different works.</p>
 </section>
 </section>
-<section id="multimodal-autoregressive-models">
-<h2>Multimodal Autoregressive Models<a class="headerlink" href="#multimodal-autoregressive-models" title="Link to this heading">#</a></h2>
-<p>The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today’s state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.</p>
-<p>Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the <a class="reference internal" href="tasks.html#description-tasks"><span class="std std-ref">Tasks</span></a> section.</p>
+<section id="multimodal-ar-models">
+<span id="multimodal-ar"></span><h2>Multimodal AR Models<a class="headerlink" href="#multimodal-ar-models" title="Link to this heading">#</a></h2>
+<p>The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today’s state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these <em>adapted LLMs</em>. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.</p>
+<p>Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the <a class="reference internal" href="tasks.html#description-tasks"><span class="std std-ref">Tasks</span></a> section.</p>
 <section id="adapted-llms">
-<h3>Adapted LLMs<a class="headerlink" href="#adapted-llms" title="Link to this heading">#</a></h3>
+<span id="id24"></span><h3>Adapted LLMs<a class="headerlink" href="#adapted-llms" title="Link to this heading">#</a></h3>
 <p>One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities
 such as audio. This is achieved via an <em>adapter</em> module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings.</p>
 <figure class="align-center" id="adapted-llm">
 <a class="reference internal image-reference" href="../_images/adapted.png"><img alt="../_images/adapted.png" src="../_images/adapted.png" style="width: 600px;" /></a>
 </figure>
-<p>🚧</p>
-<p>Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:</p>
+<p>The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. <a class="reference external" href="https://lilianweng.github.io/posts/2022-06-09-vlm/">This blog post</a> about Visual Language Models reviews these in more detail.</p>
+<p>From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.</p>
+<p>Alongside music-specialised multimodal LLMs such as those in <a class="reference internal" href="#description-models-table"><span class="std std-numref">Table 1</span></a>, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:</p>
 <ul class="simple">
-<li><p>SALMONN <span id="id20">[<a class="reference internal" href="#id82" title="Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations. October 2023. URL: https://openreview.net/forum?id=14rn7HpKVk (visited on 2024-02-22).">TYS+23</a>]</span></p></li>
-<li><p>Pengi <span id="id21">[<a class="reference internal" href="#id83" title="Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An Audio Language Model for Audio Tasks. In Thirty-seventh Conference on Neural Information Processing Systems. 2023. arXiv:2305.11834 [cs, eess]. URL: http://arxiv.org/abs/2305.11834 (visited on 2024-02-16), doi:10.48550/arXiv.2305.11834.">DESW23</a>]</span></p></li>
-<li><p>Qwen-Audio <code class="docutils literal notranslate"><span class="pre">chu_qwen-audio_2023</span></code></p></li>
-<li><p>LTU</p></li>
-<li><p><a class="reference external" href="https://link.springer.com/chapter/10.1007/978-981-97-4399-5_13">Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data</a></p></li>
+<li><p>SALMONN <span id="id25">[<a class="reference internal" href="#id92" title="Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations. October 2023. URL: https://openreview.net/forum?id=14rn7HpKVk (visited on 2024-02-22).">TYS+23</a>]</span></p></li>
+<li><p>Pengi <span id="id26">[<a class="reference internal" href="#id93" title="Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An Audio Language Model for Audio Tasks. In Thirty-seventh Conference on Neural Information Processing Systems. 2023. arXiv:2305.11834 [cs, eess]. URL: http://arxiv.org/abs/2305.11834 (visited on 2024-02-16), doi:10.48550/arXiv.2305.11834.">DESW23</a>]</span></p></li>
+<li><p>Qwen-Audio <span id="id27">[<a class="reference internal" href="#id94" title="Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: http://arxiv.org/abs/2311.07919 (visited on 2024-02-26), doi:10.48550/arXiv.2311.07919.">CXZ+23</a>]</span></p></li>
+<li><p>LTU <span id="id28">[<a class="reference internal" href="#id345" title="Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.">GLL+23</a>]</span></p></li>
+<li><p>Audio Flamingo <span id="id29">[<a class="reference internal" href="#id346" title="Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In ICML. 2024. URL: https://openreview.net/forum?id=WYi3WKZjYe.">KGB+24</a>]</span></p></li>
+<li><p>Audio-LLM <span id="id30">[<a class="reference internal" href="#id344" title="Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, Advances in Neural Networks – ISNN 2024, 133–142. Singapore, 2024. Springer Nature Singapore.">LTL24</a>]</span></p></li>
 </ul>
-<p>We don’t discuss these in detail, but their high-level design is similar to the music-specialised models we’ve seen in this section.</p>
-<section id="adapter-modules">
-<h4>Adapter Modules<a class="headerlink" href="#adapter-modules" title="Link to this heading">#</a></h4>
-</section>
-<section id="training">
-<h4>Training<a class="headerlink" href="#training" title="Link to this heading">#</a></h4>
-<p>From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.</p>
-<section id="instruction-tuning">
-<h5>Instruction Tuning<a class="headerlink" href="#instruction-tuning" title="Link to this heading">#</a></h5>
-</section>
-</section>
 </section>
 <section id="natively-multimodal-ar-models">
 <h3>Natively Multimodal AR Models<a class="headerlink" href="#natively-multimodal-ar-models" title="Link to this heading">#</a></h3>
-<p>Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
+<p>Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities <span id="id31">[<a class="reference internal" href="#id90" title="Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In 25th International Society for Music Information Retrieval Conference. August 2024. arXiv:2408.01337 [cs, eess]. URL: http://arxiv.org/abs/2408.01337 (visited on 2024-08-21), doi:10.48550/arXiv.2408.01337.">WMB+24</a>]</span>. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
 This paradigm is sometimes referred to as mixed-modal early-fusion modelling.</p>
-<p>It’s worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:</p>
-<ul class="simple">
-<li><p>AnyGPT <span id="id22">[<a class="reference internal" href="#id46" title="SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.">DCLN23</a>]</span></p></li>
-<li></li>
-</ul>
+<p>It’s worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT <span id="id32">[<a class="reference internal" href="#id342" title="Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.521, doi:10.18653/v1/2024.acl-long.521.">ZDY+24</a>]</span>, include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come.</p>
 </section>
 </section>
 <section id="references">
 <h2>References<a class="headerlink" href="#references" title="Link to this heading">#</a></h2>
-<div class="docutils container" id="id23">
+<div class="docutils container" id="id33">
 <div role="list" class="citation-list">
-<div class="citation" id="id70" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id10">CFS16</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id80" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id14">CFS16</a><span class="fn-bracket">]</span></span>
 <p>Keunwoo Choi, George Fazekas, and Mark Sandler. Towards music captioning: generating music playlist descriptions. In <em>Extended abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference</em>. 08 2016. <a class="reference external" href="https://doi.org/10.48550/arXiv.1608.04868">doi:10.48550/arXiv.1608.04868</a>.</p>
 </div>
-<div class="citation" id="id79" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id7">DML+24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id94" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id27">CXZ+23</a><span class="fn-bracket">]</span></span>
+<p>Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: <a class="reference external" href="http://arxiv.org/abs/2311.07919">http://arxiv.org/abs/2311.07919</a> (visited on 2024-02-26), <a class="reference external" href="https://doi.org/10.48550/arXiv.2311.07919">doi:10.48550/arXiv.2311.07919</a>.</p>
+</div>
+<div class="citation" id="id89" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id9">DML+24</a><span class="fn-bracket">]</span></span>
 <p>Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, <em>Findings of the Association for Computational Linguistics: NAACL 2024</em>, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2024.findings-naacl.231">https://aclanthology.org/2024.findings-naacl.231</a> (visited on 2024-07-04).</p>
 </div>
-<div class="citation" id="id83" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id21">DESW23</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id93" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id26">DESW23</a><span class="fn-bracket">]</span></span>
 <p>Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An Audio Language Model for Audio Tasks. In <em>Thirty-seventh Conference on Neural Information Processing Systems</em>. 2023. arXiv:2305.11834 [cs, eess]. URL: <a class="reference external" href="http://arxiv.org/abs/2305.11834">http://arxiv.org/abs/2305.11834</a> (visited on 2024-02-16), <a class="reference external" href="https://doi.org/10.48550/arXiv.2305.11834">doi:10.48550/arXiv.2305.11834</a>.</p>
 </div>
-<div class="citation" id="id46" role="doc-biblioentry">
+<div class="citation" id="id56" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>DCLN23<span class="fn-bracket">]</span></span>
-<span class="backrefs">(<a role="doc-backlink" href="#id13">1</a>,<a role="doc-backlink" href="#id16">2</a>,<a role="doc-backlink" href="#id19">3</a>,<a role="doc-backlink" href="#id22">4</a>)</span>
+<span class="backrefs">(<a role="doc-backlink" href="#id17">1</a>,<a role="doc-backlink" href="#id20">2</a>,<a role="doc-backlink" href="#id23">3</a>)</span>
 <p>SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In <em>International Society for Music Information Retrieval (ISMIR)</em>. 2023.</p>
 </div>
-<div class="citation" id="id68" role="doc-biblioentry">
+<div class="citation" id="id78" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>GHE22<span class="fn-bracket">]</span></span>
-<span class="backrefs">(<a role="doc-backlink" href="#id12">1</a>,<a role="doc-backlink" href="#id14">2</a>,<a role="doc-backlink" href="#id18">3</a>)</span>
+<span class="backrefs">(<a role="doc-backlink" href="#id3">1</a>,<a role="doc-backlink" href="#id16">2</a>,<a role="doc-backlink" href="#id18">3</a>,<a role="doc-backlink" href="#id22">4</a>)</span>
 <p>Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, <em>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</em>, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2022.emnlp-main.784">https://aclanthology.org/2022.emnlp-main.784</a>, <a class="reference external" href="https://doi.org/10.18653/v1/2022.emnlp-main.784">doi:10.18653/v1/2022.emnlp-main.784</a>.</p>
 </div>
-<div class="citation" id="id45" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id9">GDSB23</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id55" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id7">GDSB23</a><span class="fn-bracket">]</span></span>
 <p>Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. <em>arXiv preprint arXiv:2310.07160</em>, 2023.</p>
 </div>
-<div class="citation" id="id81" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id8">HLSS23</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id345" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id28">GLL+23</a><span class="fn-bracket">]</span></span>
+<p>Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. <em>arXiv preprint arXiv:2305.10790</em>, 2023.</p>
+</div>
+<div class="citation" id="id302" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id5">HHL+23</a><span class="fn-bracket">]</span></span>
+<p>Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, and Xuchen Song. Alcap: alignment-augmented music captioner. In <em>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</em>, 16501–16512. 2023.</p>
+</div>
+<div class="citation" id="id91" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id10">HLSS23</a><span class="fn-bracket">]</span></span>
 <p>Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. <em>arXiv preprint arXiv:2311.11255</em>, 2023.</p>
 </div>
-<div class="citation" id="id327" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id5">LPPW24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id346" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id29">KGB+24</a><span class="fn-bracket">]</span></span>
+<p>Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In <em>ICML</em>. 2024. URL: <a class="reference external" href="https://openreview.net/forum?id=WYi3WKZjYe">https://openreview.net/forum?id=WYi3WKZjYe</a>.</p>
+</div>
+<div class="citation" id="id337" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id6">LPPW24</a><span class="fn-bracket">]</span></span>
 <p>Luca A Lanzendorfer, Nathanal Perraudin, Constantin Pinkl, and Roger Wattenhofer. BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning. In <em>Workshop on AI-Driven Speech, Music, and Sound Generation</em>. 2024.</p>
 </div>
-<div class="citation" id="id77" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id6">LHSS24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id344" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id30">LTL24</a><span class="fn-bracket">]</span></span>
+<p>Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, <em>Advances in Neural Networks – ISNN 2024</em>, 133–142. Singapore, 2024. Springer Nature Singapore.</p>
+</div>
+<div class="citation" id="id87" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id8">LHSS24</a><span class="fn-bracket">]</span></span>
 <p>Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In <em>ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, volume, 286–290. 2024. <a class="reference external" href="https://doi.org/10.1109/ICASSP48485.2024.10447027">doi:10.1109/ICASSP48485.2024.10447027</a>.</p>
 </div>
-<div class="citation" id="id44" role="doc-biblioentry">
+<div class="citation" id="id54" role="doc-biblioentry">
 <span class="label"><span class="fn-bracket">[</span>MBQF21<span class="fn-bracket">]</span></span>
-<span class="backrefs">(<a role="doc-backlink" href="#id1">1</a>,<a role="doc-backlink" href="#id2">2</a>,<a role="doc-backlink" href="#id3">3</a>,<a role="doc-backlink" href="#id4">4</a>,<a role="doc-backlink" href="#id11">5</a>,<a role="doc-backlink" href="#id17">6</a>)</span>
+<span class="backrefs">(<a role="doc-backlink" href="#id1">1</a>,<a role="doc-backlink" href="#id2">2</a>,<a role="doc-backlink" href="#id4">3</a>,<a role="doc-backlink" href="#id15">4</a>,<a role="doc-backlink" href="#id21">5</a>)</span>
 <p>Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In <em>2021 International Joint Conference on Neural Networks (IJCNN)</em>, 1–8. IEEE, 2021.</p>
 </div>
-<div class="citation" id="id291" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id15">SCDBK24</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id301" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id19">SCDBK24</a><span class="fn-bracket">]</span></span>
 <p>Nikita Srivatsan, Ke Chen, Shlomo Dubnov, and Taylor Berg-Kirkpatrick. Retrieval guided music captioning via multimodal prefixes. In Kate Larson, editor, <em>Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24</em>, 7762–7770. International Joint Conferences on Artificial Intelligence Organization, 8 2024. AI, Arts &amp; Creativity. URL: <a class="reference external" href="https://doi.org/10.24963/ijcai.2024/859">https://doi.org/10.24963/ijcai.2024/859</a>, <a class="reference external" href="https://doi.org/10.24963/ijcai.2024/859">doi:10.24963/ijcai.2024/859</a>.</p>
 </div>
-<div class="citation" id="id82" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id20">TYS+23</a><span class="fn-bracket">]</span></span>
+<div class="citation" id="id92" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id25">TYS+23</a><span class="fn-bracket">]</span></span>
 <p>Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In <em>The Twelfth International Conference on Learning Representations</em>. October 2023. URL: <a class="reference external" href="https://openreview.net/forum?id=14rn7HpKVk">https://openreview.net/forum?id=14rn7HpKVk</a> (visited on 2024-02-22).</p>
 </div>
+<div class="citation" id="id90" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id31">WMB+24</a><span class="fn-bracket">]</span></span>
+<p>Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In <em>25th International Society for Music Information Retrieval Conference</em>. August 2024. arXiv:2408.01337 [cs, eess]. URL: <a class="reference external" href="http://arxiv.org/abs/2408.01337">http://arxiv.org/abs/2408.01337</a> (visited on 2024-08-21), <a class="reference external" href="https://doi.org/10.48550/arXiv.2408.01337">doi:10.48550/arXiv.2408.01337</a>.</p>
+</div>
+<div class="citation" id="id95" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id12">WNN+24</a><span class="fn-bracket">]</span></span>
+<p>Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. <em>arXiv preprint arXiv:2407.20445</em>, 2024.</p>
+</div>
+<div class="citation" id="id342" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id32">ZDY+24</a><span class="fn-bracket">]</span></span>
+<p>Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: <a class="reference external" href="https://aclanthology.org/2024.acl-long.521">https://aclanthology.org/2024.acl-long.521</a>, <a class="reference external" href="https://doi.org/10.18653/v1/2024.acl-long.521">doi:10.18653/v1/2024.acl-long.521</a>.</p>
+</div>
+<div class="citation" id="id343" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span><a role="doc-backlink" href="#id11">ZZM+24</a><span class="fn-bracket">]</span></span>
+<p>Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. <em>arXiv preprint arXiv:2410.15573</em>, 2024.</p>
+</div>
 </div>
 </div>
 </section>
@@ -762,15 +796,8 @@ <h2>References<a class="headerlink" href="#references" title="Link to this headi
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#conditioning-and-fusion">Conditioning and Fusion</a></li>
 </ul>
 </li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multimodal-autoregressive-models">Multimodal Autoregressive Models</a><ul class="nav section-nav flex-column">
-<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#adapted-llms">Adapted LLMs</a><ul class="nav section-nav flex-column">
-<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#adapter-modules">Adapter Modules</a></li>
-<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#training">Training</a><ul class="nav section-nav flex-column">
-<li class="toc-h5 nav-item toc-entry"><a class="reference internal nav-link" href="#instruction-tuning">Instruction Tuning</a></li>
-</ul>
-</li>
-</ul>
-</li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#multimodal-ar-models">Multimodal AR Models</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#adapted-llms">Adapted LLMs</a></li>
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#natively-multimodal-ar-models">Natively Multimodal AR Models</a></li>
 </ul>
 </li>
diff --git a/description/tasks.html b/description/tasks.html
index 7ac4f9c..0b551aa 100644
--- a/description/tasks.html
+++ b/description/tasks.html
@@ -477,7 +477,7 @@ <h2>Music Captioning<a class="headerlink" href="#music-captioning" title="Link t
 \]</div>
 <section id="types-of-music-captioning">
 <h3>Types of music captioning<a class="headerlink" href="#types-of-music-captioning" title="Link to this heading">#</a></h3>
-<p>Music captioning can be performed at the sub-track, track or multi-track level, depending on whether the audio input is a segment from a longer track (typically fixed-sized), a whole variable-length track, or a sequence of multiple tracks (i.e. a playlist). In the latter case, we usually refer to this type of description task by <em>playlist captioning</em> <span id="id7">[<a class="reference internal" href="#id62" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span>. Instead, when using the terms <em>music captioning</em> we usually mean captioning of either a clip or full track <span id="id8">[<a class="reference internal" href="#id38" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span> <span id="id9">[<a class="reference internal" href="#id38" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span> <span id="id10">[<a class="reference internal" href="#id79" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span>. In a variation of the music captioning task, lyrics may be considered alongside audio as additional input data to base the description on, though this is only explored in one prior study <span id="id11">[<a class="reference internal" href="#id286" title="Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, and Xuchen Song. Alcap: alignment-augmented music captioner. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16501–16512. 2023.">HHL+23</a>]</span>.</p>
+<p>Music captioning can be performed at the sub-track, track or multi-track level, depending on whether the audio input is a segment from a longer track (typically fixed-sized), a whole variable-length track, or a sequence of multiple tracks (i.e. a playlist). In the latter case, we usually refer to this type of description task by <em>playlist captioning</em> <span id="id7">[<a class="reference internal" href="#id62" title="Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.">GHE22</a>]</span>. Instead, when using the terms <em>music captioning</em> we usually mean captioning of either a clip or full track <span id="id8">[<a class="reference internal" href="../introduction/overview.html#id45" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span> <span id="id9">[<a class="reference internal" href="../introduction/overview.html#id45" title="Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.">MBQF21</a>]</span> <span id="id10">[<a class="reference internal" href="#id79" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span>. In a variation of the music captioning task, lyrics may be considered alongside audio as additional input data to base the description on, though this is only explored in one prior study <span id="id11">[<a class="reference internal" href="#id286" title="Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, and Xuchen Song. Alcap: alignment-augmented music captioner. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16501–16512. 2023.">HHL+23</a>]</span>.</p>
 <p>Music captioning is usually focused on describing global characteristics of the content, especially those that emerge at relatively long timescales (~10-30s), such as style and genre. In many cases, this type of caption does not contain references to time-localised events, temporal order, structure, or other types of temporally aware description, and can only convey a high-level, coarse summary of a musical piece. Temporal evolution within audio signals is however a crucial aspect to music. For this reason, more recent variants of this task focus on producing more fine-grained captions that capture structural characteristics  <span id="id12">[<a class="reference internal" href="#id79" title="Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.">WNN+24</a>]</span>.</p>
 <figure class="align-center" id="finegrained-captioning">
 <a class="reference internal image-reference" href="../_images/finegrained.png"><img alt="../_images/finegrained.png" src="../_images/finegrained.png" style="width: 600px;" /></a>
@@ -491,7 +491,7 @@ <h2>Music Question Answering<a class="headerlink" href="#music-question-answerin
 <a class="reference internal image-reference" href="../_images/mqa.png"><img alt="../_images/mqa.png" src="../_images/mqa.png" style="width: 600px;" /></a>
 </figure>
 <p>The MQA task is relatively recent, and the first example we find in the literature comes from Gao <em>et al.</em> (2023) <span id="id14">[<a class="reference internal" href="#id80" title="Wenhao Gao, Xiaobing Li, Yun Tie, and Lin Qi. Music Question Answering Based on Aesthetic Experience. In 2023 International Joint Conference on Neural Networks (IJCNN), 01–06. June 2023. ISSN: 2161-4407. URL: https://ieeexplore.ieee.org/abstract/document/10191775 (visited on 2024-03-01), doi:10.1109/IJCNN54540.2023.10191775.">GLTQ23</a>]</span>. In their work, the authors present a dataset of <code class="docutils literal notranslate"><span class="pre">(music,</span> <span class="pre">question,</span> <span class="pre">answer)</span></code> tuples and propose a baseline model trained to predict the answer from the music-question input, alongside items in an <em>aesthetic knowledge base</em>.
-However the MQA task has only become more established with the development of multimodal AR models such as LLark <span id="id15">[<a class="reference internal" href="#id39" title="Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.">GDSB23</a>]</span>, MusiLingo <span id="id16">[<a class="reference internal" href="#id73" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span> and MuLLaMa, which we discuss in more detail in the <a class="reference internal" href="models.html#description-models"><span class="std std-ref">Models</span></a> section.</p>
+However the MQA task has only become more established with the development of multimodal AR models such as LLark <span id="id15">[<a class="reference internal" href="../introduction/overview.html#id46" title="Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.">GDSB23</a>]</span>, MusiLingo <span id="id16">[<a class="reference internal" href="#id73" title="Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).">DML+24</a>]</span> and MuLLaMa, which we discuss in more detail in the <a class="reference internal" href="models.html#description-models"><span class="std std-ref">Models</span></a> section.</p>
 </section>
 <section id="conversational-music-description">
 <h2>Conversational Music Description<a class="headerlink" href="#conversational-music-description" title="Link to this heading">#</a></h2>
@@ -500,7 +500,7 @@ <h2>Conversational Music Description<a class="headerlink" href="#conversational-
 <a class="reference internal image-reference" href="../_images/dialogue.png"><img alt="../_images/dialogue.png" src="../_images/dialogue.png" style="width: 600px;" /></a>
 </figure>
 <p>A key difference between dialogue-based description and one-off captioning is that, instead of an <code class="docutils literal notranslate"><span class="pre">audio</span> <span class="pre">--&gt;</span> <span class="pre">text</span></code> mapping, we are now dealing with an <code class="docutils literal notranslate"><span class="pre">(audio,</span> <span class="pre">text)</span> <span class="pre">--&gt;</span> <span class="pre">text</span></code> mapping. This is reflected in the different model designs typically considered for these tasks (see <a class="reference internal" href="models.html#description-models"><span class="std std-ref">Models</span></a>). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input.</p>
-<p>In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see <a class="reference internal" href="#description-evaluation"><span class="xref myst">Evaluation</span></a>)!</p>
+<p>In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see <a class="reference internal" href="evaluation.html#description-evaluation"><span class="std std-ref">Evaluation</span></a>)!</p>
 </section>
 <section id="references">
 <h2>References<a class="headerlink" href="#references" title="Link to this heading">#</a></h2>
diff --git a/objects.inv b/objects.inv
index 2773b8f..c8a4e60 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/searchindex.js b/searchindex.js
index 7db8f89..b044fb4 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[16, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[16, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[16, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[15, "about-the-authors"]], "Abstract Musical Controls": [[2, "abstract-musical-controls"]], "Adapted LLMs": [[8, "adapted-llms"]], "Adapter Modules": [[8, "adapter-modules"]], "Adaptive Modulation/Normalization": [[21, "adaptive-modulation-normalization"]], "Advances": [[19, null]], "Aligning Language Models with Human Feedback": [[19, "aligning-language-models-with-human-feedback"]], "Apply Text Augmentation Techniques": [[28, "apply-text-augmentation-techniques"]], "Architecture": [[11, "architecture"]], "Architectures": [[8, "architectures"]], "Audio Diversity and Quality": [[12, "audio-diversity-and-quality"]], "Audio-Sentence Joint Embedding": [[28, "audio-sentence-joint-embedding"]], "Audio-Tag Joint Embedding": [[28, "audio-tag-joint-embedding"]], "Audio2Audio Controls": [[2, "audio2audio-controls"]], "Autoregressive Language Models": [[21, "autoregressive-language-models"]], "Background": [[17, null]], "Beyond Audio Modality": [[1, null]], "Beyond Text-Based Interactions": [[2, null]], "Beyond semntica attributes, toward handle similarity queries": [[28, "beyond-semntica-attributes-toward-handle-similarity-queries"]], "Bibliography": [[0, null]], "Chain-of-Thought Reasoning of Language Models": [[19, "chain-of-thought-reasoning-of-language-models"]], "Challenges": [[20, null], [23, null]], "Channel Concatenation": [[21, "channel-concatenation"]], "Code Practice": [[4, null], [24, null]], "Code Tutorial": [[10, null]], "Conclusion": [[3, null]], "Conclusion \ud83c\udf89": [[4, "conclusion"], [24, "conclusion"]], "Conditioning": [[11, "conditioning"], [21, "conditioning"]], "Conditioning and Fusion": [[8, "conditioning-and-fusion"]], "Connecting Music Audio and Natural Language": [[15, null]], "Conversational Music Description": [[9, "conversational-music-description"]], "Conversational Retrieval": [[25, null]], "Datasets": [[5, null]], "Diffusion Model-based Text-to-Music Generation": [[11, null]], "Diffusion: Continuous Generation through Iterative Refinement": [[11, "diffusion-continuous-generation-through-iterative-refinement"]], "Distillation of Language Models": [[19, "distillation-of-language-models"]], "Early Stage Retrieval Methods": [[27, "early-stage-retrieval-methods"]], "Early Stage of Music Annotation and Retrieval": [[17, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[17, "early-stage-of-music-generation"]], "Employ Strategic Negative Sampling": [[28, "employ-strategic-negative-sampling"]], "Encoder-Decoder Attention (a.k.a. Cross Attention)": [[21, "encoder-decoder-attention-a-k-a-cross-attention"]], "Encoder-Decoder Models": [[8, "encoder-decoder-models"]], "Evaluation": [[6, null], [12, null], [26, null]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[12, "frechet-inception-distance-fid-fad"]], "Future Directions": [[25, "future-directions"]], "Getting Started": [[15, "getting-started"]], "History": [[13, "history"]], "Human-written text": [[5, "human-written-text"]], "Implementing Language Models": [[21, "implementing-language-models"]], "Inception Score": [[12, "inception-score"]], "Inference & Make Retrieval Engine": [[24, "inference-make-retrieval-engine"]], "Initialize with Pre-trained Models": [[28, "initialize-with-pre-trained-models"]], "Instruction Tuning": [[8, "instruction-tuning"]], "Introduction": [[4, "introduction"], [7, null], [13, null], [22, null], [24, "introduction"], [27, null]], "Key Benefits of Conversational Retrieval": [[25, "key-benefits-of-conversational-retrieval"]], "Key Technical Challenges": [[25, "key-technical-challenges"]], "Langauge Models": [[18, "langauge-models"]], "Language Models as a Framework": [[21, "language-models-as-a-framework"]], "Let\u2019s Get Started! \ud83d\ude80": [[4, "let-s-get-started"], [24, "let-s-get-started"]], "Leverage Diverse Training Data Sources": [[28, "leverage-diverse-training-data-sources"]], "Limitation": [[12, "limitation"]], "Limitations": [[19, "limitations"]], "Limitations of Single-Turn Systems": [[25, "limitations-of-single-turn-systems"]], "Listening Test": [[12, "listening-test"]], "MOS Test (Mean Opinion Score)": [[12, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[12, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Masked Language Models": [[21, "masked-language-models"]], "Match-based metrics": [[6, "match-based-metrics"]], "Metric Learning Loss Functions": [[28, "metric-learning-loss-functions"]], "Models": [[8, null], [28, null], [28, "id3"]], "Motivation & Aims": [[15, "motivation-aims"]], "Multi-modal Joint Embedding Model Architecture": [[28, "multi-modal-joint-embedding-model-architecture"]], "Multimodal Autoregressive Models": [[8, "multimodal-autoregressive-models"]], "Multimodal Decoders for Language Model Outputs": [[19, "multimodal-decoders-for-language-model-outputs"]], "Multimodal Encoders for Language Model Inputs": [[19, "multimodal-encoders-for-language-model-inputs"]], "Music Captioning": [[9, "music-captioning"]], "Music Classification": [[9, "music-classification"]], "Music Description": [[18, "music-description"]], "Music Generation": [[18, "music-generation"]], "Music Question Answering": [[9, "music-question-answering"]], "Music Retrieval": [[18, "music-retrieval"]], "Music description models.": [[8, "description-models-table"]], "MusicGEN": [[14, null], [14, "id4"]], "Natively Multimodal AR Models": [[8, "natively-multimodal-ar-models"]], "Neural Audio Codec": [[14, "neural-audio-codec"]], "Other types of automatic evaluation": [[6, "other-types-of-automatic-evaluation"]], "Overview of Tutorial": [[18, null]], "Overview of this tutorial section": [[7, "overview-of-this-tutorial-section"]], "Performance & Efficiency": [[20, "performance-efficiency"]], "Precision and Recall": [[26, "precision-and-recall"]], "Prefix Conditioning": [[21, "prefix-conditioning"]], "Prerequisites": [[4, "prerequisites"], [24, "prerequisites"]], "Problem Definition": [[13, "problem-definition"]], "Problem: Out of Vocabulary": [[27, "problem-out-of-vocabulary"]], "Query-Caption Distribution Mismatch": [[23, "query-caption-distribution-mismatch"]], "References": [[5, "references"], [6, "references"], [8, "references"], [9, "references"], [17, "references"], [18, "references"], [23, "references"], [25, "references"], [27, "references"], [28, "references"]], "Representation": [[11, "representation"]], "Representation: Text as Sequence of Tokens": [[21, "representation-text-as-sequence-of-tokens"]], "Resources for Further Learning \ud83d\udcda": [[24, "resources-for-further-learning"]], "Resources \ud83d\udcda": [[4, "resources"]], "Results \ud83d\udcc8": [[4, "results"]], "Retrieval-Augmented Generation (RAG)": [[19, "retrieval-augmented-generation-rag"]], "Scaling Laws of Language Models": [[19, "scaling-laws-of-language-models"]], "Single-Turn Retrieval Limitations": [[23, "single-turn-retrieval-limitations"]], "Stable Audio Open Tutorial": [[10, "stable-audio-open-tutorial"]], "Step 1: Setting Up Our Environment": [[24, "step-1-setting-up-our-environment"]], "Step 1: Setting up our environment": [[4, "step-1-setting-up-our-environment"]], "Step 2: Loading the data \ud83d\udcca": [[4, "step-2-loading-the-data"]], "Step 2: Understanding the Data \ud83d\udcca": [[24, "step-2-understanding-the-data"]], "Step 3: Creating Our Dataset Class \ud83c\udfa8": [[24, "step-3-creating-our-dataset-class"]], "Step 3: Creating our dataset class \ud83c\udfa8": [[4, "step-3-creating-our-dataset-class"]], "Step 4: Building & Training Our Model Architecture \ud83c\udfd7\ufe0f": [[24, "step-4-building-training-our-model-architecture"]], "Step 4: Building and training our model \ud83c\udfd7\ufe0f": [[4, "step-4-building-and-training-our-model"]], "Synthetic Text": [[5, "synthetic-text"]], "Tasks": [[9, null]], "Text Relevance": [[12, "text-relevance"]], "The Framework": [[21, null]], "The axes of music description": [[7, "the-axes-of-music-description"]], "Tips for Training Audio-Text Joint Embedding Models": [[28, "tips-for-training-audio-text-joint-embedding-models"]], "Tool Use and Function Calling": [[19, "tool-use-and-function-calling"]], "Training": [[8, "training"]], "Transfer Learning from Language Models": [[19, "transfer-learning-from-language-models"]], "Trust & Safety": [[20, "trust-safety"]], "Types of music captioning": [[9, "types-of-music-captioning"]], "What We\u2019ll Build": [[24, "what-we-ll-build"]], "What are language models?": [[22, "what-are-language-models"]], "What is music description? And why do we need it?": [[7, "what-is-music-description-and-why-do-we-need-it"]], "What is the Benefit of Joint Embedding?": [[28, "what-is-the-benefit-of-joint-embedding"]], "What we will build": [[4, "what-we-will-build"]], "Why Natural Langauge?": [[16, null]], "Zero-shot Task Transfer and In-Context Learning": [[19, "zero-shot-task-transfer-and-in-context-learning"]]}, "docnames": ["bibliography", "conclusion/beyondaudio", "conclusion/beyondtext", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/advances", "lm/challenges", "lm/framework", "lm/intro", "retrieval/challenge", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/evaluate", "retrieval/intro", "retrieval/models"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/beyondaudio.md", "conclusion/beyondtext.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/advances.md", "lm/challenges.md", "lm/framework.md", "lm/intro.md", "retrieval/challenge.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/evaluate.md", "retrieval/intro.md", "retrieval/models.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [0, 2, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27, 28], "0": [4, 5, 9, 10, 11, 12, 16, 24, 26, 28], "00": [], "000": 22, "000061": 10, "00006103515625": 10, "00341": [0, 18], "0050b2820a1e709ffa623f9a9e8ae42d0903535f2150613cbfeb7f16932a": [], "00512": [], "0083": 24, "00830": [], "0092": 4, "00927": [], "01": [0, 9], "01095": [], "01103": 0, "01324": [], "01337": [0, 6], "01420": [0, 5], "01546": [], "01618": [], "01626": [], "01652": [0, 18], "01733": [], "01840": [], "019": [], "01917": 0, "02": [0, 4, 8, 24], "021c1d407befb505791764ad2cbd56ceaaa53a746baed01d2e2143f05f18": [], "02252": 0, "02257": 0, "02696": [], "03": [0, 9], "03458": [], "03499": [0, 17], "03739": [], "03748": [0, 28], "03917": [], "04": [0, 5, 8, 9], "04208": [], "04378": [], "04628": [], "04658": [], "04805": [0, 18, 28], "04868": [0, 8], "05": [], "05011": 0, "05224": [], "056d58b606731f94fe395266c604ea9efcecc10e6857ceb9b10e6831d746": [], "0577": 4, "0583": 4, "0586": 4, "0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822": [], "05967": [], "06": [0, 9], "06125": [], "06174": [], "06178": 0, "0686": [], "07": [0, 5, 8, 9, 24], "0702": 4, "07069": [0, 18], "07160": [0, 8, 9, 18], "07439": [0, 23, 25], "07724": [], "07837": [0, 17], "07848": [], "0791": 4, "07919": [], "08": [0, 6, 8], "08070": [], "08384": 0, "08466": [], "08667": 0, "08691": [], "08774": [0, 18], "08803": [], "09": [0, 6], "0933": 4, "09636": [], "0984": 4, "0a": [], "0a0": [], "0a1": [], "0b": [], "0da8e798b168": 4, "0dfc83e0fe455cfe6272b23a65039b4101c63a4e7446801e26178b675fbf": [], "0ea5e3611e0b63766a56f81e7bc5cfa05c52e3a3f0b8d66b25c7262aeda": [], "0m": [], "1": [0, 2, 5, 6, 8, 9, 10, 11, 12, 15, 18, 19, 26, 28], "10": [0, 4, 5, 6, 8, 9, 10, 15, 24, 28], "100": [4, 12, 24], "1000": 10, "10057": [0, 5, 23], "10191775": [0, 9], "1024": [4, 24], "1025": [0, 23], "10301": [], "1032d0dbc2152c45f3d1e582a72e68f41898de9665202392d9400dfa329d": [], "1038": [], "10447027": [0, 5, 8], "1045": [0, 23], "104e9f575c27679ffedf994e53e6ac39067a0e77b2ea0d1567d4738686": [], "106": [], "1068": 0, "1076": [0, 9], "1077": 0, "10789": [], "10828fb40dcf097d1af84c1f2f863bae4046d5949450bf95b3260f767672": [], "10970": 0, "10f97f73544edcdef54409f1d839f6049a0d79df68adbc1ceb24d1aaca42": [], "11": [0, 6], "1109": [0, 5, 8, 9], "1116": 0, "1120": 0, "11255": [0, 5, 8], "11305": 0, "11315": 0, "11325": [0, 5, 18], "113k": 5, "114": [], "11401": [0, 8, 9], "11415": [0, 8, 9], "1141a8232723dcb10a595cc0ce4321dcbbd5215300bf4acfc142343205bf": [], "1146": 24, "11489": [0, 25], "11498": [0, 28], "114m": [], "115": [], "1165": 4, "11692": [0, 28], "11757": [], "1180": 0, "11834": [0, 8], "1186": [], "1188": 0, "11994": [], "11k": 5, "12": [0, 16], "12015": [], "1208": [0, 9], "120bpm": 16, "121": [], "1212": [0, 9], "12179": [], "12207897848a653d03ebbf6775a29d949408ded5f99b2d87198bc5c93508": [], "12208": [0, 18, 28], "12415": [0, 18, 28], "125": 0, "125817600": 4, "12661": [], "12662": 0, "1267": 4, "128": [4, 24], "12839": [], "13": [0, 16], "130bpm": 16, "13218": [], "13301": [], "13438": 0, "13569": [0, 28], "1362": 0, "13686": [], "1371": 0, "13731": [], "14": [4, 15], "140": [0, 10, 18], "1412": [], "14167": [], "1426": 4, "14358": 0, "1446": 4, "14784": [0, 5], "14793": [0, 5], "1481": 4, "14867": [], "149": [], "14rn7hpkvk": [0, 8], "15": 28, "150": 16, "15018": [], "1514580907b0bac0970415e5e24ef96a9c1fa71dcf2aa0139045b58fae9a": [], "1534": 0, "15573": [0, 6], "156": [], "15885": [0, 6], "16": [0, 8, 12, 13, 16, 17, 18, 27], "1601": [4, 24], "1604": [], "1608": [0, 8], "1609": [0, 17], "1612": [0, 17], "162": 0, "163": [], "16322": [], "16372": [], "16501": [0, 9], "16512": [0, 9], "1679": 24, "16798": [0, 9], "17": [0, 12, 13, 14, 17], "17042": [], "17162": [], "173": 0, "179": [], "179dd1bf8fd6bd689f0907f4baed557d2b12d2cf3d7ed1a8ecefe0a63d83": [], "17a": 0, "17b": [0, 13], "17th": [0, 8], "18": [0, 13, 17, 18], "1802": [], "1805": [], "1807": [0, 28], "1810": [0, 18, 28], "1812": [], "18407": [], "18503": [], "18653": [0, 8, 9], "1869": 4, "1874": 4, "18754": 24, "18828": [], "18th": 0, "19": [0, 13, 18], "1907": [0, 28], "19159": [], "1937": [], "194": 0, "1950": [], "19512": [], "1964": [], "1970": 13, "1975": [], "1979": 0, "1982": 0, "1983": 0, "1989": 0, "1990": 13, "1992": [], "19d5ff584cb58f654d22d8d6552d7c2fff7b85e4a9d525357f62a4d1e7e0": [], "1a": [], "1b69b697fe067d51219cfd64d0712bcbbce3b187389cb0793d9844ec14b1": [], "1bdb57a072903b222b1a745aa634cb845ff5f52a88ddd5ed1640ecf30beb": [], "1c": [], "1d": [11, 14], "1e": [4, 24], "1f": [], "1f0a22a6bcdd3fc26c73f63a025d05bd565901b729d56bcb093c722a6c4c": [], "1k": [], "1m": 11, "2": [0, 2, 3, 5, 6, 8, 10, 11, 15, 17, 18, 19, 27], "20": [0, 12, 13, 15, 18], "200": 27, "2000": 17, "2001": 0, "2002": [0, 9], "2003": [0, 9], "2005": [0, 17, 18, 27], "2007": [0, 17], "2008": [0, 17, 18, 27], "2009": [], "2010": [0, 9, 17, 23], "2012": [], "2013": [], "2014": [], "2015": 13, "2016": [0, 8, 17], "2017": [0, 9, 17], "2018": [0, 13, 17, 18, 28], "2019": [0, 17, 18, 28], "202": 0, "2020": [0, 13, 18], "2021": [0, 8, 9, 11, 15, 18, 28], "2022": [0, 8, 9, 18, 28], "2023": [0, 5, 8, 9, 18, 23, 25, 28], "2024": [0, 5, 6, 8, 9, 15, 18, 23, 25, 28], "20445": [0, 5, 9], "207": [], "20a": 0, "20b": [0, 13], "20xx": [], "21": [0, 6, 9, 11, 16, 18, 28], "2104": [], "2109": [0, 18], "2110": [], "2111": 0, "214": [], "21450": 0, "21474": 0, "2161": [0, 9], "21783": [], "21th": 0, "22": [0, 8, 13, 16, 18, 28], "2202": [], "2204": [], "2205": 0, "22050": [4, 24], "2206": [], "2208": [0, 18, 28], "2210": 0, "2211": [], "2226": 0, "2231": 4, "2234": 0, "22a": [], "22b": [], "22k": 5, "23": [0, 2, 5, 8, 9, 11, 12, 13, 18, 23, 25, 28], "2301": [0, 5, 18, 25], "2302": 0, "2303": [0, 18], "2304": [], "2305": [0, 8], "2307": [], "2308": [], "231": [0, 5, 8, 9], "2310": [0, 8, 9, 18], "2311": [0, 5, 8, 18, 23], "2312": [], "2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324": [], "2350": 0, "2354": 0, "2358": 4, "237m": [], "238": 0, "2392": [0, 9], "2396": [0, 9], "23a": [0, 11], "23b": [0, 13], "23ef2fd02913d65d43dc7516fc829af709314a66c6f0bdc2e361fdcecc2d": [], "24": [0, 2, 5, 6, 8, 9, 11, 13, 14, 18, 23, 25], "2401": [], "2402": 0, "2403": [], "2404": [0, 28], "2405": [], "2406": [0, 6], "2407": [0, 5, 9], "2408": [0, 6], "2409": [0, 28], "2410": [0, 6], "2411": [0, 23, 25], "249": [], "24963": [0, 8], "24a": [], "24b": [], "24th": 0, "25": [0, 18], "25bcf75e373412daf1fd88045ab3aa8140a0d804ef0e70712c4f2c5b94d8": [], "25h": [], "25hcollect": [], "25hdownload": [], "25hrequir": [], "25l": [], "25th": [0, 6, 15], "26": [], "26045404a30c8a200e960fb54fbaf4b73d12e58cd28e03b306b084253f4f": [], "262145": 24, "263": [], "264k": 5, "265": 10, "266": [], "27": [], "2713830": [0, 9], "273186269": 0, "2754": [], "2764": [], "2788": 24, "28": 0, "28492": [], "28518": [], "286": [0, 5, 8], "287": [], "2880": 0, "2894": 0, "28k": 5, "28th": 0, "29": 5, "290": [0, 5, 8], "2919": 24, "293": [0, 9], "2971": 24, "2a": [], "2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7": [], "2b": [], "2c": [], "2d": 11, "2d1c0ebfd092e25935b86509a9a817159212d82aa43d7fb07eca4eeff2c2": [], "2d231b35456506b7c98b3ab9bbf07917b205fed8615d2e59e976ab497fff": [], "2d512efdb0de203d1f0312fae53433c3009ba70b0078421d25baaedc960a": [], "2e": [], "2eb3cd785efd67806c46c13a17339708ddc346cbb684eade7a6e6f79536a": [], "2f": [], "2k": 5, "2m": 5, "2min": 5, "2ugen": [], "3": [0, 3, 5, 6, 9, 10, 15, 17, 18, 19], "30": [5, 9, 10, 11, 22], "300": [4, 24], "302": [0, 9], "30aa32745af16af0a9a650115fbe81bde7c610ed5c21b381fca0196f3a7f": [], "31": [], "3122": 4, "313": 0, "3169": [], "317": [], "31884": [4, 24], "319": [], "31m": [], "31m1": [], "31m10": [], "31m108": [], "31m11": [], "31m12": [], "31m122": [], "31m13": [], "31m14": [], "31m141": [], "31m15": [], "31m16": [], "31m17": [], "31m172": [], "31m191": [], "31m2": [], "31m3": [], "31m4": [], "31m470": [], "31m493": [], "31m5": [], "31m6": [], "31m7": [], "31m742": [], "31m768": [], "31m796": [], "31m8": [], "31m834": [], "31m836": [], "31m837": [], "31m839": [], "31m845": [], "31m848": [], "31m849": [], "31m85": [], "31m853": [], "31m855": [], "31m860": [], "31m861": [], "31m868": [], "31m872": [], "31m874": [], "31m878": [], "31m884": [], "31m890": [], "31m897": [], "31m9": [], "31m900": [], "31m904": [], "31m913": [], "31m918": [], "31m920": [], "31m921": [], "31m925": [], "31m937": [], "31m942": [], "31m947": [], "31m949": [], "31m95": [], "31m973": [], "31m978": [], "31m982": [], "31m995": [], "31merror": [], "32": [0, 4, 9, 11], "324": 0, "326": 0, "32767": 10, "32m0": [], "32m1": [], "32m10": [], "32m106": [], "32m11": [], "32m112": [], "32m12": [], "32m121": [], "32m122": [], "32m13": [], "32m14": [], "32m143": [], "32m149": [], "32m15": [], "32m16": [], "32m162": [], "32m163": [], "32m17": [], "32m174": [], "32m179": [], "32m18": [], "32m19": [], "32m2": [], "32m20": [], "32m207": [], "32m21": [], "32m214": [], "32m22": [], "32m23": [], "32m24": [], "32m25": [], "32m26": [], "32m266": [], "32m27": [], "32m28": [], "32m287": [], "32m29": [], "32m3": [], "32m30": [], "32m31": [], "32m317": [], "32m319": [], "32m32": [], "32m33": [], "32m333": [], "32m34": [], "32m35": [], "32m36": [], "32m368": [], "32m37": [], "32m38": [], "32m389": [], "32m39": [], "32m392": [], "32m399": [], "32m4": [], "32m40": [], "32m41": [], "32m42": [], "32m43": [], "32m434": [], "32m44": [], "32m45": [], "32m46": [], "32m47": [], "32m48": [], "32m481": [], "32m49": [], "32m5": [], "32m50": [], "32m51": [], "32m519": [], "32m52": [], "32m53": [], "32m54": [], "32m55": [], "32m56": [], "32m563": [], "32m59": [], "32m6": [], "32m60": [], "32m61": [], "32m614": [], "32m616": [], "32m63": [], "32m64": [], "32m7": [], "32m71": [], "32m727": [], "32m73": [], "32m76": [], "32m77": [], "32m774": [], "32m78": [], "32m8": [], "32m81": [], "32m87": [], "32m890": [], "32m899": [], "32m9": [], "32m90": [], "32m92": [], "32m94": [], "33": [], "331": 0, "333": [], "33437": 24, "33k": 5, "34": 0, "3479": 4, "34th": 0, "35": [], "3523": 24, "3572": 24, "35th": 0, "36": [], "360": [], "3643": [0, 5, 8, 9], "3655": [0, 5, 8, 9], "368": 0, "36m": [], "36m0": [], "37": [], "3727": 24, "375": 0, "38": [], "39": 4, "392": [], "39c7c0d87f8d4e6c020a393182060eaefeeae6c01dab6a84ec346f2567df": [], "3a": [], "3af39d34be01a24a6e65433d19e107099374224905f1e0cc6bbe1fd22a2f": [], "3b": [], "3b00ac340a1aab3389ebcc52c779914a44aadf7b0cb7a3bf053195735607": [], "3c": [], "3d": [], "3e": [], "3f": [], "3k": [4, 24], "3m": 10, "4": [0, 3, 5, 6, 7, 10, 15, 16, 18], "40": [], "41": 0, "42": [0, 16, 28], "43": [], "434": [], "435d5d7ec64d1c8b422ac9ebe42d2f3b2ac0b3f8a56f5c04dd0f3b7ba83c": [], "4361": 0, "4370": 0, "44": 11, "440": [], "4407": [0, 9], "44100": 10, "45": [4, 24], "4524": 4, "454d6e7f0158951d8a78c2e1eb4f69ae81beb8dca5fee9809c6c99e9d0d0": [], "456": 0, "4583": [0, 28], "4587": [0, 28], "46": [], "460": 0, "46649": 24, "467": [0, 17, 18, 27], "46th": [0, 18, 25], "47": [], "476": [0, 17, 18, 27], "48": [], "48072": 24, "48550": [0, 6, 8], "4868": 24, "49": [], "4b": [], "4c": [], "4c4672025c23a305231a81bf492f65aa3ea0965a89f9ca369a9ee7d47fd9": [], "4d": [], "4e": [], "4f": [4, 24], "4f639c1168d7aada749a896afb4892a831e2041bebdcf636aebfe9e86556": [], "4o": 19, "5": [0, 3, 4, 5, 9, 10, 12, 16, 18, 23, 24, 25, 28], "50": [4, 10, 27], "500": 10, "5063": 24, "51": 5, "519": [], "52": [], "521": [], "5244": 24, "525": [], "53": [0, 18], "5302": 24, "531": [0, 17], "534": [0, 17], "53k": 5, "54": [], "540": [], "541": [], "55": [], "5593a40fcd0981bda85274bb3e622ac433a94ae1e11ef8639de362cfa7d": [], "55bpm": 16, "55cdeed5889f2076fdb125bc87bb7ab0f1715c84b0a4619c44833d890f60": [], "56": [0, 28], "564beb0c78bf83018a146dfcdc959c99c10a0d136480b932a350c852adbc": [], "566": [], "57": [], "5730cc60bf438b56438756e45ac469c01bcf9c47d87632c468623167b7f": [], "5781": 4, "58": [], "580600f441f6fc05218bd6c9d5794f4aef072a7d9093b291f1c50a9db8bc": [], "58b70a580de00893223d61de8fea167877a3aed97d4a5e1405c9159ef925": [], "58d71f2041bc89919f56a69f8f2b9535a55d513bb005fbe4f8ee5d367170": [], "59": [], "591": [0, 28], "595": [0, 28], "5a": [], "5a36494314e4780362b15a7e190095eec68366a0d512b5b532607c213a26": [], "5af6804c4cc0fed83f47bff6e413a98a36618e7d40185cd36e69737f3b0": [], "5b": [], "5c": [], "5d": [], "5e": [], "5f30aea01532961bab043775258b06484f2a57530a88940e4cc3aea4f1f1": [], "5k": 5, "5min": 5, "6": [4, 5, 16, 24, 28], "60": [], "607": [], "608": 10, "609961972f694cb9520c4c3d201e377a26583e1eb83bc5a334c893729214": [], "60cd92bd3ec00948800984410f4cf5ded5bd8e9b715729f3642efe0edb3d": [], "61": [0, 23], "616": [], "61b627404c2d6f31dcbc491ff83da1f4336c7ae7893cfdc6c52db490ec59": [], "621": [], "6262": 4, "63": [], "6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8": [], "64": 11, "6402242dde160d9ef9903487b4277443dc3da04615f6c4d3b48564a8ab57": [], "65": [], "66": [], "661": [], "6626": 0, "6637": 0, "67": [0, 18], "671c0e1f2572ba625cbcc1faeba9435e00330c3d6962858711445cf1e817": [], "6724805521ab4e723a12182f92374031032aff28a8a89dc8505c52b79032": [], "6742ef9206409d5ce1fdf44d5ca1687cdc3847ba0485424e2c731e6bcf67": [], "67ebd9d6ce9e65747e720c4c5614cd3a137e61340aec274657fcd9cc5162": [], "68": [], "6809": 24, "681": [], "684": [], "69": [], "693": [], "6980": [], "6a": [], "6b": [], "6d": [], "6e": [], "6e30b6b0cc0c18f8eb566e4f440e8127d9dad32bcaa70d38c8c44a21e62d": [], "6e9f9b41c48750a45ad07cc6d43a2979bfc09e6989656aece97cc59cbef1": [], "6f": [], "7": [4, 5, 10, 16, 24, 25], "70": [0, 18], "7047": 24, "71": [], "72": [], "72a58cb3b241d869811be4f9328a37f1563dc9c48af8c0467cb681f9ed46": [], "73": [], "74": [], "75718504a1bf0562e7e02def34cfc9bb274b6f284773cbeeeba0767a31b": [], "75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40": [], "76": [], "7616": 0, "7633": 0, "768": 24, "77": 0, "774": [], "7762": [0, 8], "7770": [0, 8], "77cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1": [], "77edf4c29c8d6728b49d3f0abb22159bb9c0c4ddebd721c09486b34985c8": [], "78": [], "784": [0, 8, 9], "78bd0e95dd2444b6caacbca2b730671d4295ccb628ef58b81bee903629df": [], "7907": 24, "7925": 24, "7952585": [0, 9], "7b": 5, "7b5a1a5419e400f715387a48f65225ec7a3f2104465f346fc75e8793407b": [], "7c": [], "7dcce24e978bc14a18e2a3f7e2d6f4d2001533dc0cffab143bb3f8ec13d6": [], "7e": [], "7f": [], "8": [0, 4, 5, 8, 9, 16, 18, 24, 28], "80": 0, "800560": [0, 9], "80370da514096c6190f8913668198380ea09c2d252cfa4e85a9c096d3b40": [], "804": 0, "807": 0, "80cc3315dd1ca706643b78f894901d4d888ffe376a5e401f73d9db61071": [], "81": [], "8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c": [], "81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82": [], "828": [], "83": [], "83871f3c50fc983b88547c196d11cf8c3340e37c32d2e9d6152abe2c61f7": [], "84": 0, "8462": 24, "85": [], "85249acbac630f34cd113dca4b1a72f55d3ad4c26bc9305a27aef6049756": [], "859": [0, 8], "86": [0, 9], "8630": 4, "8653ae6d18e20183fc6051fd2e10cd0c46e16a6b71eb34edef8d465dc969": [], "86bb218c7926e1da7a52e0696cab120a17c995933f08d8228d9aa83b44c5": [], "87": [], "8748": [0, 28], "8763": [0, 28], "88": [], "8821": [], "8831": [], "88k": 5, "89": [], "890a583cd3f2be27ecf32b479d5d615710bb926d92da03e3f7838ff3e58b": [], "899": [], "8a": [], "8b5d82fe2d9c7f260fb73121418f5e07d4e38c329ea3886a5b0e55586113": [], "8c": [], "8c75caed8f2462d63c7fd65e16c832b8f76cda331ac9e615e914ee80bac9": [], "8d": [], "8da8dd078b354a89602a875d310a0d725dad92b5b4d61069576e0a0e02e4": [], "8dd4d6de0fbba9d8f10d7b655be0578d5bda6e4db425210c265b0ea6c804": [], "8df4efa78df8b129847c8a7c0e492376cca62ab68453e5a20375a1c6291b": [], "8df927d3f0951cf67ca5973d89b35bcbda1777a4c78bf90a853d02d91285": [], "8e": [], "8f": [], "8f0c4a5bb9fd491c277c21eff7ccae71b47d43c4446c9d0c6cff2fe8c2c4": [], "8f8e631fcdc2ff978609eaeef1d6994bf2f028b59d9ac67640ed051f1218": [], "8k": 5, "9": [4, 5, 24, 28], "90": 4, "9048": 24, "9090": 4, "91": [], "917": [23, 25], "92": [], "9240": 24, "927e3a8899e52a27fa57a48607ff7dc91a9ebe97399b357b85a0c7892e00": [], "93": [], "9315": 4, "937": [0, 9], "9375917786cb39270b0ee6634536c0e22abf225825602688990d8f5c6c19": [], "9377bcb415797e44274b51d46e3249eba641711cf3348050f76ee7b15ffc": [], "93f7309eb40a9299c59a6637f13c21b08e585c569fee85901ccd55ce00f5": [], "94": [], "943": [], "94797cfe0263a30805f3074e535adfde02b885ac43d1e4dac85f82213b0b": [], "94c7dab8cfe7d41a23133634576fb89412e3430f28ca8d44411a77c2f18d": [], "95": [], "952": [0, 9], "953": [], "96": 11, "96142937f66150805c25c4d0f31ee4132fd33497753400734f9dfdcbdc66": [], "9748": 4, "98": [], "99": [], "9963d588cc3d75d766c819e0377a168ef83cf3316a92769971527a1ad1d": [], "9a": [], "9a683359ad2ed11b2303a7a94800db19c61d33fa3bde271df09e99936022": [], "9b": [], "9b2eab7833494e7c82f70c9b2f8e907d38231f4535704e3045a8a4960c8": [], "9c": [], "9cf1a409640adac045750b2ba9d1355c83942fbae74f21284c2133292be": [], "9eb14d4e9ef366be2020063d91c4f608294969fcd7b9fcc48153c64b9776": [], "9f1413bef53171f379d786aabc104d4abeea48ee84c553a3e3d8c9f96a9c": [], "9f1894efa1bb15e98613244b24dfbacfe2309e0ac3cfc27d4c608c2270d2": [], "9k": 5, "A": [0, 4, 5, 6, 8, 9, 11, 12, 17, 19, 21, 24, 26], "AND": 27, "AT": [], "And": [4, 11, 15], "As": [2, 3, 4, 6, 7, 8, 11, 14, 17, 19, 20, 22, 25, 27], "At": [8, 28], "BY": 5, "Being": [7, 20], "But": [4, 8, 19, 21, 24], "By": [13, 24, 26, 28], "For": [2, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 24, 25, 27, 28], "If": [8, 9, 10, 11, 17, 19, 22], "In": [0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 27, 28], "It": [0, 4, 8, 18, 19, 21, 22, 24, 26], "Its": [4, 19], "NOT": 27, "No": [4, 10, 16, 24], "OR": 27, "Of": 8, "On": 2, "One": [8, 9, 11, 12, 28], "Or": [0, 4], "That": 21, "The": [0, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28], "Their": [25, 28], "Then": [11, 13], "There": [2, 11, 20, 21, 23], "These": [2, 4, 5, 6, 8, 11, 12, 13, 17, 19, 21, 23, 27, 28], "To": [2, 4, 10, 11, 12, 15, 19, 23, 28], "Will": [], "With": [4, 15, 19], "_": [8, 11, 22, 24, 28], "_0": [8, 11], "_1": 26, "_2": 26, "__getitem__": [4, 24], "__init__": [4, 24], "__len__": [4, 24], "_brownian": 10, "_c": [], "_end": 10, "_get_default_devic": [], "_i": 8, "_n": 26, "_q": 26, "_t": [8, 11], "a1": [], "a2": [], "a2t_project": 4, "a3": [], "a39c835871caca0173f526e321336a1a2b0961e38bf9b71b7213b651e3c8": [], "a4": [], "a5": [], "a6": [], "a61ef6f7faf98edadf4ce8094873d298f8582a3ec59b65c9174c516926e8": [], "a6c031bc1590789a3da14bd6a9cccc46c932401765d6d8f37e75c8214b44": [], "a7": [], "a8": [], "a812df4e2dd5696d1f351d58b8fe16a405b234ad2886a0dab9183fb78109": [], "a_i": 28, "aa": [], "aaa": [0, 18], "aaai": [0, 15], "aaron": [0, 17, 28], "ab": [0, 6, 8, 10], "ab44c871b0f07f491e5d2ad12c9bd7358e527510618cb1b803a88e986db1": [], "abbeel": 0, "aberman": [], "abhimanyu": [], "abhinav": [], "abhishek": 0, "abi3": [], "abil": [0, 8, 11, 12, 16, 17, 28], "abl": [7, 11, 12, 21, 23], "ablat": 12, "about": [0, 2, 3, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 26, 28], "abov": [8, 11, 13, 14, 16, 19, 21], "abraham": [], "absent": 23, "absl": [], "absl_pi": [], "abstract": [0, 3, 6, 7, 8, 9, 22], "abu": [0, 8, 9], "academ": [3, 15], "acceler": [0, 15], "access": [3, 15], "accompani": [0, 2, 4, 5, 9, 13, 24], "account": [4, 10, 24], "accur": [2, 12, 13, 19, 28], "accuraci": [19, 26, 28], "achiam": [0, 18], "achiev": [3, 4, 8, 9, 13, 19], "aclanthologi": [0, 5, 8, 9], "acm": [0, 18, 25], "acoust": [0, 5, 8, 9, 16, 18, 23, 24, 28], "acquir": [], "across": [2, 13, 15, 19, 23, 25, 26, 28], "activ": [8, 11, 15, 27], "actual": [11, 16, 23, 26], "ad": [6, 11, 17, 19], "adaln": 11, "adam": [0, 5, 17, 18], "adamw": [4, 24], "adapt": [9, 11, 12, 13, 19, 28], "adb": [0, 5, 13, 18], "add": [4, 11, 24], "add_special_token": 4, "addit": [2, 8, 9, 11, 12, 15, 19, 21, 23], "addition": [2, 12, 13, 16, 21, 25, 27, 28], "address": [2, 3, 8, 9, 18, 19, 20, 23, 25, 27, 28], "adi": [0, 18], "aditya": [0, 28], "adjust": 19, "adler": [0, 18], "adob": 15, "adobephotoshopsenseiarteam": [], "adopt": [6, 8, 11], "advanc": [0, 3, 5, 7, 8, 12, 13, 15, 16, 17, 18, 20, 21, 23, 28], "advantag": [3, 9, 16, 17, 25, 28], "adversari": [0, 14, 17], "advis": 15, "ae": [], "ae30dadffc90b9006d77af76b393cb9dfbfc9629f339fc1574a1c52e6806": [], "aed7a284c00dfa7c0682d14df85ad4955a350a21d2e3b06d8240497359bf": [], "aeiou": [], "aesthet": [0, 9], "af": [], "af0d1f58f86002be0cf1e2665cdd6f7a4a71cdc8a7a9438cdc9e3b5375f": [], "affect": [10, 21], "after": [8, 10, 11, 19], "afternoon": 5, "again": 21, "against": [12, 26], "agarw": [0, 18, 28], "agent": [19, 20], "aggreg": [0, 6, 9], "aggress": 16, "agostinelli": [0, 5, 18], "agrawala": [], "ahead": 2, "ahm": [], "ahmad": [0, 18], "ai": [0, 8, 10, 15, 16, 20, 22, 24], "ai4cc": [], "aidan": 0, "aiesha": [], "aila": [], "aim": [3, 13, 17], "aiobotocor": [], "aiofil": [], "aiohappyeyebal": [], "aiohttp": [], "aioitertool": [], "aiosign": [], "aittala": [], "ajai": 0, "ajit": [], "aka": 11, "akash": [], "akhgari": [], "akhil": [], "akkaya": [0, 18], "aksan": [], "akten": [], "al": [4, 8, 9, 24, 25], "alaluf": [], "alan": [], "alban": [], "albert": 0, "album": 5, "alcap": [0, 9], "alec": [0, 18, 28], "alejandro": 0, "alek": [], "aleksand": [], "aleman": [0, 18], "alex": [0, 17, 18], "alexand": [0, 27], "alexandr": [0, 18], "alexei": [], "algorithm": [0, 10, 13], "ali": [], "alia": [], "alias_free_torch": [], "align": [0, 2, 9, 13, 21, 23, 28], "all": [0, 2, 3, 8, 9, 11, 12, 15, 19, 21, 26, 27], "allow": [2, 8, 9, 11, 15, 16, 19, 24, 28], "allud": 19, "almeida": [0, 18], "almost": [11, 13, 21, 23], "alon": 0, "along": [5, 7, 11, 27], "alongsid": [6, 8, 9], "alpha": [], "alphabet": 20, "alreadi": [10, 21, 28], "also": [2, 3, 6, 7, 8, 9, 11, 12, 13, 15, 16, 19, 20, 21, 23, 25, 28], "altenschmidt": [0, 18], "altern": [5, 8, 17, 18, 19, 21], "although": [8, 16, 19, 21], "altman": [0, 18], "alwai": [6, 8, 19], "amanda": [0, 28], "amaz": 24, "amazon": 5, "ambient": 28, "ambuj": [], "america": [], "american": [0, 23], "ami": [], "amir": [], "amirmojtaba": [], "amit": [0, 5, 9], "amodei": [0, 18], "among": [5, 6, 8], "amount": 19, "amp": 10, "amplitud": 19, "amu": [7, 9], "an": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "anaconda3": 10, "anadkat": [0, 18], "analogi": [0, 5], "analys": 7, "analysi": [0, 23, 25, 28], "analyt": 23, "analyz": [3, 26, 27], "anandkumar": [], "anchor": 28, "and82": [0, 11], "anderson": 0, "anderson2016": [], "andi": [], "andr": 0, "andrea": [0, 5, 18], "andrew": [0, 17, 18], "andrii": [], "angela": [], "angelo": 0, "anger": 27, "ani": [2, 3, 4, 10, 11, 19, 21, 22], "anil": [], "anima": [], "animesh": [], "anirudh": [], "anjali": [], "ann": 15, "anna": 0, "annot": [0, 5, 7, 16, 18, 23, 27, 28], "annotated_typ": [], "anoth": [4, 8, 19, 21], "ansel": [], "answer": [0, 3, 5, 6, 8, 16, 19, 22], "anthem": 24, "anticipatori": [0, 13], "antoin": [0, 5, 18], "antonio": [], "anygpt": 8, "anyi": [], "anyio": [], "anyon": 4, "anyth": [2, 11, 19, 20, 21, 22], "anytorch": [], "aouameur": 0, "ap": 26, "apach": 5, "apart": 28, "api": [0, 10, 19], "appdir": [], "appear": [8, 21, 22], "append": [11, 24], "appl": 15, "appli": [5, 12, 13, 15, 19, 21], "applic": [0, 3, 7, 9, 15, 17, 18, 19, 20, 21, 22, 27], "appreci": 4, "approach": [0, 2, 3, 6, 8, 9, 11, 13, 15, 17, 19, 21, 22, 23, 24, 25, 27, 28], "appropri": [12, 17, 25, 26, 27], "approx": 11, "approxim": 11, "ar": [0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "arab": [0, 8, 9], "arang": 24, "arash": [], "arbitrari": [21, 28], "arbor": 15, "architectur": [2, 3, 4, 7, 12, 13, 14, 16, 21, 22], "area": [12, 13, 15, 17, 19, 22, 25], "aren": [0, 5, 18, 19, 28], "argbind": [], "argpars": [], "aris": 3, "armi": [0, 6], "around": [4, 8, 10], "arrai": [4, 24], "arrang": 28, "arriv": 19, "art": [0, 2, 8, 9, 13], "arthur": 0, "articl": 19, "articul": 13, "artifici": [0, 8, 15], "artist": [2, 13, 17, 23, 27, 28], "artsiom": [], "arun": [0, 18, 25], "arxiv": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "ashish": 0, "ask": 19, "askel": [0, 28], "aspect": [8, 9, 12, 23, 28], "assess": [0, 6, 12, 26, 27], "assign": [9, 12, 13, 21, 28], "assist": 19, "associ": [0, 5, 8, 9, 18, 27, 28], "assum": 8, "ast": 28, "asttoken": [], "atin": [0, 5, 8], "attempt": [8, 17, 23, 27], "attend": [19, 28], "attent": [0, 2, 4, 8, 11, 14, 19, 28], "attention_mask": [4, 24], "attr": [], "attribut": [16, 17, 18, 23, 27], "atzmon": [], "audio": [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 17, 18, 19, 20, 23, 24, 27], "audio_2023": 8, "audio_base64": 5, "audio_byt": 5, "audio_embedding_dim": [4, 24], "audio_forward": 24, "audio_html": 5, "audio_project": 24, "audio_sampl": 10, "audiogen": [0, 13], "audioldm": [0, 13], "audiolm": [], "audioread": [], "audioset": 5, "audiotool": [], "audit": [], "auditori": 18, "augment": [0, 5, 8, 9, 11], "august": [0, 6], "auraloss": [], "authent": 10, "author": 9, "auto": [0, 9, 14, 18], "autocast_mod": 10, "autoencod": [0, 11, 17, 19], "autom": [17, 20], "automat": [0, 7, 9, 15, 17], "automodel": [4, 24], "autonom": 20, "autoregress": [11, 13, 19, 22, 28], "autoregresst": 13, "autosav": 10, "autotoken": 24, "auxiliari": 5, "av": [], "avaiabl": 10, "avail": [4, 10, 17, 19, 24], "avent": 0, "avenu": 2, "averag": [6, 12, 25, 26], "avoid": 10, "awai": [5, 6, 11, 24], "awar": 9, "ax": [], "axel": [], "axi": [7, 11, 16], "ayan": [], "ayh": [], "azalea": [], "b": [0, 10, 19], "b1": [], "b161908e2f51be56568184aeb4a880fd287178d176fd1c860d2217f41106": [], "b2": [], "b3": [], "b4": [], "b6": [], "b64encod": 5, "b67ebd7e19ffe259f05d3cf4547326725c3113d640c277030be3e9998d6f": [], "b7": [], "b8": [], "b86984bed139586d01532a587464b5805f12e397594f19f931c4c2fbfa61": [], "b9": [], "b95df0b8593aee5d9e68b9a9f24e83c69657afb46b24f83b57098d926401": [], "b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc": [], "ba": [], "ba44652d562cbf0bf320e0f3810206149c8a4e99cdbf66da82e97ab53a15": [], "bach": [0, 13, 17, 18], "back": [10, 11, 13, 19, 20, 23, 27], "backbon": [8, 11], "background": 23, "backpropag": [], "backward": [4, 24], "bad": 2, "bahjat": [], "bahri": [], "bai": [], "baid": [], "balaji": [], "balanc": [8, 26], "balog": [0, 18, 25], "banjo": 24, "bao": [], "bar": [4, 11], "barn": [], "barret": [0, 18], "barrett": [], "barrington": [0, 17, 18, 27], "barron": [], "bart": 8, "barzilai": [], "base": [0, 3, 5, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "base64": 5, "baselin": [9, 19, 28], "bash": 10, "basi": 5, "basic": [4, 17, 18, 19, 21, 22, 24], "bass": 28, "batch": [4, 10, 24, 28], "batch_siz": [4, 24], "bay": [], "bb9ff095ae7b1b6908480f683b6ca6b71c2105d343a5e5cb25334b01f5fa": [], "bc": [], "bd": [], "bdt": [], "be958fefa589186b54daaa9a72fa1a2e19e42a2dcab87ee15c8273259da0": [], "beach": 28, "beat": [0, 5, 18, 24], "beatl": 28, "beauti": 4, "becaus": [11, 12, 16, 17, 19, 20, 21, 22, 24, 26], "becom": [2, 8, 9, 13, 16, 19, 20, 21, 27], "beeler": [], "been": [2, 9, 10, 11, 12, 13, 15, 19, 21, 22, 23, 27], "befor": [10, 11, 12, 19, 27], "began": 13, "begin": [18, 21, 24, 28], "behav": 19, "behavior": [4, 19, 23, 24], "behind": [12, 19], "being": [3, 9, 11, 16, 19, 21, 23, 26, 28], "believ": 15, "bell": [], "below": [5, 6, 8, 10, 11, 19, 21, 26, 28], "belt": 4, "ben": 0, "benchmark": [0, 6], "benefit": [7, 9, 15], "beneto": [0, 5, 6, 8, 9, 15, 18, 23, 28], "bengio": [0, 17], "benjamin": [0, 8], "benno": [0, 5, 6, 23, 28], "benzi": [], "berard": [], "berg": [0, 8, 15, 18, 28], "bergman": [], "bermano": [], "bernard": [], "bernhard": 0, "bert": [0, 6, 11, 18, 21, 22, 24, 28], "bertin": [0, 17], "bespok": 2, "best": [2, 3, 11, 15, 19, 27, 28], "beta": [], "beta_": 8, "bethard": [0, 5, 8, 9], "better": [4, 6, 8, 9, 11, 19, 20, 22, 23, 24, 25, 26, 28], "between": [3, 6, 7, 8, 9, 12, 13, 15, 17, 19, 20, 23, 25, 26, 27, 28], "beyond": [0, 9, 17, 18, 19, 25, 26], "bhe23": [], "bi": 28, "bia": [20, 24], "bian": [], "bias": [12, 20], "bichen": [], "bidirect": [0, 11, 18, 28], "big": [11, 19], "bigger": [19, 20], "biggest": 20, "bigvgan": [], "bilei": [], "billion": 19, "bin": [], "binari": [26, 27], "bing": [], "bingchen": [], "biomed": [], "bit": [8, 10, 11], "bittner": [0, 8, 9, 18], "bj": [], "bjd": [], "black": [], "blank": [19, 21, 22], "blap": [0, 8], "blattmann": [], "bleach": [], "blend": [4, 13, 24], "bleu": 6, "bleu_1": 6, "blob": 10, "block": [8, 10, 11, 14], "blocker": 20, "blog": [0, 11, 18], "blown": 24, "blue": [5, 13, 16], "blurri": 19, "bmv": [], "bnh": [], "bo": [], "bockkschlut": [], "bockkw16": [], "bodganov": [0, 5, 23], "bodi": 2, "boesel": [], "bogdanov": [0, 6], "bohan": [], "boissier": [], "bokeh": [], "boldsymbol": [8, 11], "bolei": [], "book": [3, 4, 15], "booktitl": [], "boolean": [18, 27], "boost": 2, "bootstrap": [0, 8], "borgeaud": [], "bori": 0, "borrow": 6, "borso": [0, 5, 18], "bos_embed": 4, "bos_token_id": 4, "bosma": [0, 18], "bot": 20, "both": [2, 4, 8, 9, 11, 12, 19, 25, 26, 27, 28], "botocor": [], "bottleneck": [13, 14], "bottom": 21, "boyer": [0, 9], "bpe": [21, 28], "bpm": 10, "braceexpand": [], "brahma": [0, 18], "bram": [], "brandon": [0, 9], "brass": 5, "braun": [], "break": [0, 2, 4, 8, 11, 24, 28], "breakthrough": 13, "breathtak": 4, "brebisson": [], "bresson": [], "breviti": 6, "brian": [0, 17, 18, 27], "bridg": [0, 3, 5, 8, 9, 13, 18, 27, 28], "briefli": [6, 16, 19], "bright": 16, "bring": 19, "broad": [2, 11, 13, 22], "broadcast": 21, "broader": [23, 27], "brockman": [], "broken": 8, "brook": [], "broomel": [], "brownian_interv": 10, "brows": 27, "browser": [10, 24], "brox": [], "brualla": [], "bruno": 0, "bryan": [0, 5, 18], "bsv": [], "btyld23": [], "budget": [8, 19], "build": [2, 11, 15, 23, 25, 28], "built": [11, 24, 27, 28], "bulid": 24, "bunch": 19, "burcu": [], "burgeon": 15, "burovski": [], "byte": [5, 21, 28], "bytecod": [], "byted": 15, "c": [0, 6, 8, 11, 16, 17, 19], "c1": [], "c13ea695a4393639830bf96baea956538ba7a9d06fcce7cef10bfff20f72": [], "c188ac517f402775b90d6f312955a5e53b866c964b32119f2ed76315697": [], "c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6": [], "c2": [], "c316262244abea7481f95f1e91d7575f3dfcf6455d56d1ffe9839c582eb1": [], "c4": [], "c463dc5fc02fbe019566d067a9d18746cd3c664f29c9b8b3c3f9ed025365": [], "c4dm": [], "c5": [], "c6": [], "c691e6c5d925a364d63eec27d1f10477ca7902febe10a8e1f86284dba754": [], "c869a1fbd481dcb02c70032fd6a7243de7582bc48c7cae03d6f0985a11c0": [], "c8bfa8cbcd3ea1d25d2beb359b5c5a3f4339a7e2e5d9e3ef3e29ba3ab3b9": [], "c9b96572ab7994e73c64588f8875741823f2daba70e746547fff9a2d9a54": [], "ca": 15, "cacer": 0, "cach": [], "cacul": 12, "cai": [], "caillon": [0, 5, 18], "calcul": [12, 19, 21, 26], "california": [15, 28], "call": [4, 8, 10, 11, 16, 21, 22, 24, 25], "cambridg": [], "came": 28, "campaign": 20, "can": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "cancel": [], "candid": [6, 15], "cangea": [], "cannot": [9, 11, 17, 23, 27, 28], "cao": 0, "cap": [4, 24], "capabl": [3, 8, 13, 15, 17, 19, 23, 25, 28], "capac": 4, "capit": 11, "caption": [0, 2, 4, 5, 6, 7, 8, 11, 15, 16, 18, 24, 28], "caption2emb": 24, "captiv": [4, 24], "captur": [4, 6, 7, 8, 9, 17, 22, 23, 24, 25, 28], "carbonneau": 0, "care": [8, 23, 25, 26], "carefulli": [20, 26, 28], "carlo": [], "carnovalini": [], "carol": [0, 5, 9], "carr": 0, "carri": 8, "carrol": [0, 18], "casagrand": [], "cascad": 17, "case": [2, 5, 7, 8, 9, 10, 11, 12, 21, 23], "caseb": 0, "casei": [], "cast": 8, "casual": 5, "cat": [4, 16, 24], "catalog": 27, "catanzaro": [], "catchi": 24, "categor": [9, 21], "categori": [2, 9, 10, 11, 26, 28], "cater": 15, "caus": 11, "causal": [21, 28], "cb": [], "cc": [4, 5, 24], "cc3a402a6439c15c3d4294333e13042b915bbeab54edc457c723931fed3f": [], "ccf007edf442c3c0cd3a98be2c82bc99edc957c04436a759b6e1e01077e0": [], "cck": [], "cd": 15, "cd10c82398f3b39bbf60a300e09c931bdf6844f3f2fba9ab2b5981501f9f": [], "cdescrivan17": [], "cdot": [11, 28], "cdz": [], "ce": 0, "ce21": [0, 2], "ce6964e9f8822f6e63ebc59bdcc5ae445126b7356da63188fa0e6265054": [], "cell": [4, 10, 16, 24], "celma": [0, 17], "celso": [], "cem": [], "center": 4, "centr": 15, "central": [19, 26], "certain": [19, 21, 22], "certainli": 11, "certifi": [], "cf": [], "cffi": [], "cfg": 10, "cfg_scale": 10, "cfs16": [0, 8], "cfsc17": [0, 9], "chaganti": [0, 18, 25], "chakrabarti": [], "challeng": [0, 3, 6, 7, 13, 15, 17, 18, 19, 22, 27, 28], "cham": [], "chan": [], "chanan": [], "chang": [0, 2, 18, 19, 22, 28], "changli": [0, 8], "changx": [], "changyou": [0, 9], "channel": [2, 11, 17, 19], "chao": [0, 8], "chaowei": [], "chaoyu": [], "chapter": [3, 16, 18, 22, 23, 25], "character": [], "characterist": [4, 6, 8, 9, 23, 27, 28], "charli": [], "charset": [], "chartmetr": 15, "chat": [22, 25], "chatgpt": [16, 19, 22], "chatthe": [], "chaudhuri": [], "chauhan": [], "che23": [], "cheaper": 21, "cheapli": 19, "cheat": 12, "chechik": [], "check": [4, 10, 11, 16, 24], "chelsea": [], "chemistri": [], "chen": [0, 5, 6, 8, 9, 15, 18, 23, 28], "cheng": 0, "chenji": 0, "chenlin": [], "chenshuo": [0, 5, 8], "chenyang": [], "chet": 0, "cheung": 0, "chia": 0, "chiang": 0, "chieh": 0, "child": [0, 18], "chinchilla": 19, "ching": [], "chintala": [], "chitwan": [], "chiu": [], "chiyuan": [], "chl": [0, 18], "cho": [0, 9], "cho_unifying_2021": [], "choi": [0, 8, 9, 17, 18, 23, 25, 28], "choic": [4, 6, 8, 19, 24, 28], "chong": [0, 18], "chongxuan": [], "choos": 8, "choppi": 5, "choral": [0, 13], "chord": [2, 5], "choru": 2, "chosen": 9, "chou": [0, 17, 18], "chourdia": [], "chri": [0, 17, 18, 28], "christian": [], "christin": [0, 18], "christina": [], "christoph": 0, "chronolog": 11, "chu": [], "chu_qwen": 8, "chul": [], "chun": [], "chung": [0, 18], "cider": 6, "cinjon": [0, 17], "circul": 2, "circumv": 11, "cite": [], "citep": [], "cites": [], "citi": [0, 5, 8, 9], "cj": 0, "ck": [], "ckg": [0, 13, 14, 18], "ckm": [], "ckp": [], "clamp": [10, 24], "clap": [11, 12], "clariti": 11, "clark": [0, 28], "class": [2, 16, 21], "classic": [4, 5, 11, 16, 24, 27], "classif": [0, 3, 7, 12, 17, 18, 19, 27, 28], "classifi": [9, 10, 27], "claud": 19, "clean": [10, 11], "clean_fid": [], "clean_up_tokenization_spac": [4, 24], "cleaner": 11, "cleanli": 11, "clear": [6, 8, 9, 11], "clever": [11, 19], "clich\u00e9": 13, "click": [], "client": [], "clip": [4, 9, 10, 15, 16, 19], "clip_anytorch": [], "clone": [4, 15], "close": [8, 13, 19, 28], "closer": [5, 28], "closest": 17, "clpn19": [0, 18, 28], "cluster": [], "clz": [0, 18, 25], "cn24": [], "cnn": 8, "co": [0, 8, 10, 12, 15, 28], "coars": [2, 9], "coca": [0, 16], "code": [0, 3, 11, 15, 18, 19, 28], "codec": [11, 12, 18], "coeffici": 11, "cohen": [], "coher": 25, "col10": [], "col12": [], "colab": [4, 24], "colin": [0, 18], "collab": 10, "collabor": 28, "collect": [0, 7, 17, 19, 20, 27], "colleg": 15, "collin": [], "colloqui": 23, "color": 16, "colorcet": [], "com": [0, 4, 10, 15, 24], "combin": [3, 6, 8, 11, 12, 14, 15, 24, 27, 28], "come": [6, 8, 9, 11, 19, 20], "comm": [], "command": 10, "common": [4, 6, 7, 8, 19, 21, 28], "commonli": [6, 12], "commun": [2, 15, 16, 17, 28], "compani": 19, "companion": [], "compar": [6, 7, 12, 13, 16, 17, 18, 19, 21, 25, 26, 27, 28], "comparison": [6, 19, 26], "compat": [], "compil": [], "complement": 12, "complementari": 26, "complet": [3, 11, 15, 19, 23, 24, 26, 28], "complex": [3, 7, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 27, 28], "compon": [3, 4, 6, 8, 18, 19, 21, 25, 28], "compos": [2, 8, 13, 25], "composit": [0, 13, 15], "comprehend": [8, 15], "comprehens": [3, 12, 15, 26, 27, 28], "compress": [0, 11], "compris": [], "comput": [0, 5, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 21, 26, 27, 28], "computation": [2, 21], "concaten": [2, 8, 11, 14], "concept": [4, 9, 11, 13, 16, 18, 19, 24, 26, 28], "conceptu": 11, "concern": [11, 19], "conclud": [3, 11], "conda": [10, 15], "condens": 11, "condit": [0, 2, 3, 4, 9, 10, 14, 17, 18, 19, 22], "conduct": [3, 23], "confer": [0, 5, 6, 8, 9, 15, 17, 18, 25, 28], "confid": 12, "config": [], "configpars": [], "configur": [], "cong": [0, 6], "congratul": [3, 24], "connect": [8, 17, 21, 24, 28], "connectionist": 0, "connelli": [], "consecut": 22, "consensu": 6, "consequ": [8, 28], "consid": [4, 6, 9, 12, 20, 22, 23, 26, 28], "consider": [2, 26], "consist": [0, 4, 5, 8, 13, 14, 15, 17, 28], "consistut": [], "consolid": 8, "constabl": [], "constant": [5, 19], "constantin": [0, 8], "constitu": 28, "constrain": [9, 28], "constraint": 13, "construct": [4, 28], "consum": 12, "consumpt": 20, "contain": [2, 5, 9, 15, 23, 25, 27], "contemporari": [4, 24], "content": [0, 2, 3, 4, 5, 7, 9, 15, 17, 23, 24, 27, 28], "context": [6, 8, 11, 20, 21, 23, 25, 28], "contextu": [6, 16, 19, 21, 22, 23, 25, 28], "contigu": 4, "continu": [3, 6, 12, 14, 19, 21, 22], "contourpi": [], "contrast": [0, 3, 12, 16, 18, 19, 24, 28], "contribut": [13, 15], "contributor": [], "control": [0, 5, 11, 13, 15, 17, 18, 19, 20, 22, 28], "controlnet": [0, 2, 18], "convei": [4, 9, 13], "conveni": 6, "convent": [15, 25], "converg": 0, "convers": [0, 3, 8, 12, 15, 18, 22, 23], "convert": [10, 11, 14, 18, 28], "convert_tokens_to_id": 4, "convolut": [0, 9, 11, 14], "cooijman": [], "cook": [0, 9], "copet": [0, 18], "copi": 4, "copyright": 5, "core": [5, 8, 11, 28], "corner": 12, "corpora": [16, 28], "corpu": [0, 5, 21, 22, 23], "corpusid": 0, "corr": 0, "correct": [4, 11, 19], "correctli": [19, 26], "correl": [2, 6], "correspond": [9, 11, 12, 19, 21], "corrupt": [11, 28], "cosin": [6, 12, 26], "cosmo": 0, "cost": [11, 12, 19, 20, 27], "costli": [], "cot": 19, "could": [11, 17, 21, 23, 25, 27, 28], "couldn": [], "count": [8, 22, 26], "countri": 24, "coupl": 7, "cours": [8, 19], "courvil": [0, 17], "cover": [7, 9, 15, 16, 17, 21, 22, 23, 27], "coverag": [23, 27], "cp26": [], "cp27": [], "cp311": [], "cp32": [], "cp33": [], "cp34": [], "cp35": [], "cp36": [], "cp37": [], "cpcd": 25, "cpjku": [], "cpp": [], "cpu": [4, 10, 24], "cqt": [], "crawl": 5, "creat": [2, 3, 5, 6, 7, 12, 15, 17, 19, 23, 25, 26, 27, 28], "create_audio_html": 5, "creation": [0, 13, 15, 18], "creativ": [0, 2, 8, 13, 17], "cref": [], "criteria": [12, 23, 26, 27], "criterion": 4, "critic": [20, 26, 27, 28], "crop": 2, "cross": [2, 8, 11, 14, 19, 28], "cross_entropi": [4, 24], "crossentropyloss": 4, "crowdsourc": 5, "crucial": [9, 12, 26, 28], "csl": 13, "csrc": [], "cuda": [4, 10, 24], "cue": [13, 18], "cultur": [20, 23, 28], "cun": [], "curat": [0, 18, 25], "current": [2, 3, 6, 8, 9, 10, 13, 18, 21, 23, 25], "curti": 0, "curv": 11, "custom": [2, 5, 19, 24], "cut": 18, "cutoff": [20, 26], "cvf": [0, 5], "cvpr": [0, 5, 15], "cvpr52688": 0, "cvpr52729": [0, 5], "cvsf23": [], "cwbergkirkpatrickd20": [0, 13], "cwbkd20": [], "cwl": [0, 11, 13, 18], "cxh": [], "cxz": [0, 28], "cxzg16": [], "cyclegan": [], "cycler": [], "cyran": 0, "cyril": [0, 17], "czj": [0, 13], "d": [0, 4, 10, 11, 15, 18, 19, 24], "d1": [], "d110f0a43beb365758a252203c43eaaad169fe7749da918869a8c991f726": [], "d1e337b9b4c8ea3aae5d399ace8c9cf4c2a7789cfe9d14766511fbc83c8b": [], "d2": [], "d23a97e0a2c690d40b165d1062e2c4ccc796be458a1ce59f6ba030434663": [], "d2805324fb746d8da86d3844bee4f55c0cfd6c136de61b713772d44c5bea": [], "d3": [], "d4": [], "d497a310bde3f01cb805196ac61b7ad6dc5dcf8dce66634dc34364b20b4f": [], "d5": [], "d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08": [], "d8": [], "d9": [], "d_": 8, "d_c": 11, "d_h": 11, "d_k": [], "d_t": 11, "d_w": 11, "da": [], "dabeaf902892922777492e1d253bb7e1264cadce3cea932f7ff599e53fea": [], "dac": 11, "dacheng": [], "daeyong": [0, 23, 25], "dahl": [], "dai": [0, 5, 9, 18, 28], "daiq": [], "dall": 19, "damien": [], "dan": [], "danc": [2, 5, 16], "danceabl": 5, "dang": [], "daniel": [0, 5, 8, 9, 18, 28], "danilo": 0, "dannenberg": [], "dao": [], "dao23": [], "dara": [], "dario": [0, 18], "dark": 16, "dasaem": [0, 28], "data": [0, 3, 5, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 21, 23, 25, 27], "databas": [17, 19, 24, 26, 27], "datafram": 5, "dataload": [4, 24], "dataset": [0, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 23, 25, 27, 28], "date": [13, 19, 23], "dateutil": [], "daunt": 20, "davi": [], "david": [0, 17, 18, 27], "dawen": [], "dazhong": [], "db": 24, "db99aa669eee301966bc6c997d60a0240f9cecae63f044b2e5a5310e4bf7": [], "dbvb17": [], "dc39062efec7515add304b98a54da2948709a808": [], "dcd": [0, 13], "dck": [0, 23, 25], "dcln23": [0, 8, 18], "dclt18": [0, 18, 28], "dcr": [0, 2], "dcsa22": [0, 11], "dctorch": [], "dd": [], "ddp09": [], "ddpm": [0, 2], "ddsp": [0, 13], "de": 11, "de3276d773ab6ce3ad676df5fab5aac19696b2956319d65d7dd88fb10f19": [], "deadlock": 10, "deaf": 7, "deal": [8, 9, 11, 21, 27], "decemb": [0, 8, 9], "decid": 19, "decis": 19, "decod": [3, 4, 5, 11, 14, 17, 18, 24], "decompos": 21, "deconvolut": 14, "decor": [], "decreas": 19, "dedic": [10, 11], "deep": [0, 4, 7, 8, 9, 12, 13, 15, 17, 18, 22, 24, 28], "deepak": [], "deepanwai": [0, 5], "deepbach": [0, 13], "deeper": [11, 15, 19, 25], "deepfak": 20, "deepli": 17, "deepmind": 15, "def": [4, 5, 24], "default": [4, 10, 24], "defferrard": [], "defin": [2, 4, 8, 10, 11, 18, 21, 22, 24, 26], "definit": [8, 18, 22], "defossezcsa23": [0, 14], "degara": [], "degre": [], "dehghani": [0, 18], "dekel": [], "delet": 10, "delic": 4, "delight": 3, "deliv": [4, 24], "delta": 28, "delv": [15, 18], "demo": [0, 8, 10], "demonstr": [3, 14, 16, 19, 25, 28], "den": [0, 17, 28], "deng": [0, 5, 8, 9], "dengsheng": [0, 17], "denk": [0, 5, 18], "denois": 0, "denot": [8, 9, 11, 22, 28], "dens": 28, "densiti": [2, 11], "denton": [], "depart": 15, "departur": 15, "depend": [6, 8, 9, 10, 11, 12, 15, 19, 21, 22, 28], "deploi": [], "depract": [4, 24], "depth": [3, 18], "deriv": [5, 11, 12, 15], "desc": [4, 24], "descent": 22, "describ": [0, 2, 4, 5, 7, 8, 9, 16, 23, 27], "descript": [0, 3, 5, 6, 13, 15, 16, 19, 23, 24, 27, 28], "descript_audio_codec": [], "descript_audiotool": [], "description_evalu": [], "description_model": [], "description_models_t": [], "description_task": [], "descriptor": 9, "deserv": [], "deshmukh": [0, 8], "desideatum": 12, "design": [2, 4, 6, 7, 8, 9, 10, 11, 12, 23, 28], "desir": [17, 19], "desktop": 19, "desmaison": [], "despit": [17, 21], "dessert": 3, "desw23": [0, 8], "detach": [4, 24], "detail": [2, 4, 6, 8, 9, 12, 13, 15, 16, 18, 21, 22, 24, 28], "detect": [20, 27], "determin": [11, 19, 26], "develop": [0, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 22, 25, 27, 28], "devi": 0, "devic": [4, 10, 24], "device_typ": 10, "devin": [], "devis": 8, "devito": [], "devlin": [0, 18, 28], "df": 5, "df18d492a8f00d29a30db307904b9b296e37507034eedb523876f3a2e13": [], "df4b9b42f2be0b623cbd5e2140cafcaa2bef0759a00b7b70104dcfe2fb51": [], "df630c387a0a054815d60be6a97eb4e8f17385d5d6fe660e1c02750062b4": [], "dhabi": [0, 8, 9], "dhariw": [0, 18], "dhyy18": [0, 13], "di": 0, "dialog": 18, "dialogu": [0, 6, 7, 9, 15, 23, 25], "dickstein": 0, "dict": [], "did": [8, 21], "diederik": 0, "diego": 15, "dieleman": [0, 11, 17], "diff": [0, 2], "differ": [4, 5, 6, 7, 8, 9, 11, 12, 14, 17, 23, 24, 26, 27, 28], "differenti": [0, 2, 8, 11], "differnt": [], "difficult": 19, "difficulti": [16, 19], "diffus": [0, 2, 3, 10, 13, 16, 18, 19], "diffwav": [], "dig": 22, "digit": 0, "dim": [4, 24], "dimens": [11, 21, 23], "dimension": [11, 28], "dimitra": [], "dinculescu": 0, "ding": [], "dinh": [], "diogo": [0, 18], "direct": [3, 7, 8, 11, 15, 19, 23, 28], "directli": [2, 5, 8, 10, 11, 13, 19], "disabl": 10, "disadvantag": 3, "discard": 23, "discount": 6, "discov": [23, 25, 27], "discoveri": [0, 15, 23, 25], "discret": [2, 3, 11, 14, 18, 19, 21], "discrimin": [11, 14, 17, 19, 28], "discuss": [2, 3, 7, 8, 9, 11, 12, 15, 18, 19, 20, 25, 28], "dispatch": 19, "displai": [4, 5, 10, 24], "dispos": 8, "dissimilar": 4, "dist": [4, 24], "distanc": [0, 24, 26, 28], "distil": 0, "distinct": [12, 13, 27, 28], "distinguish": [0, 4, 7, 8, 17, 18, 28], "distribut": [2, 5, 8, 11, 12, 17, 19, 21, 22], "dit": 11, "ditto": [0, 2, 15, 18], "div": 10, "diverg": 12, "divers": [0, 15, 18, 23, 25], "dixon": 0, "djgd21": [], "djp": [0, 13, 18], "dkb14": [], "dl": 8, "dljn24": [0, 28], "dmitri": [0, 5, 6, 23], "dmitrii": [], "dml": [0, 5, 8, 9], "dmp18": [], "dmp19": [0, 17], "dn21": [0, 11], "do": [0, 2, 8, 11, 15, 19, 21, 22], "do_sampl": 4, "doc": 10, "docker": [], "docker_pycr": [], "dockhorn": [], "docnam": [], "docstr": [], "docstring_pars": [], "doctor": 15, "document": [0, 4, 9, 24], "doe": [6, 9, 10, 11, 12, 19, 21, 24], "doesn": [2, 11, 19, 23], "doh": [0, 5, 8, 15, 18, 23, 25, 28], "doi": [0, 5, 6, 8, 9], "domain": [0, 2, 3, 4, 8, 11, 12, 13, 14, 15, 16, 17, 21, 28], "domin": 4, "dominik": 0, "don": [3, 8, 10, 19, 21, 23, 27], "donahu": [0, 17, 18], "donald": [], "done": [6, 10, 11], "dong": [0, 5, 9], "dongchao": [], "dongdong": [], "dongjun": 0, "dorien": [0, 5], "doshi": [], "dot": [9, 12, 28], "dougla": [0, 9, 17, 18, 27], "down": [11, 28], "downbeat": [], "download": [10, 24], "downsampl": 11, "downstream": [15, 16], "dpm": [], "dpmpp": 10, "dpo": 19, "dramat": [25, 28], "draw": 12, "drawback": 9, "dreambooth": [], "dreamfus": [], "drift": 11, "drive": 13, "driven": [0, 8, 17, 18], "drop": [0, 28], "drop_last": [4, 24], "dropout": 28, "drum": 2, "dsdb16": [], "dtype": [4, 24], "du": [0, 18], "duan": 0, "dubei": [], "dubnov": [0, 8, 9, 18, 28], "duc": 0, "due": [13, 20, 21, 25], "duet": 24, "duh": [0, 5, 8, 9], "dumoulin": [], "dung": [], "durand": [0, 8, 9, 18], "durat": [10, 13], "dure": [12, 15, 18, 19, 27, 28], "dvdos18": [], "dwcn23": [0, 16, 18, 28], "dylan": [], "dynabert": [], "dynam": [0, 13, 19, 24, 28], "d\u00e9fossez": 0, "e": [0, 2, 4, 5, 8, 9, 11, 16, 19, 21, 22, 25, 26, 27, 28], "e0": [], "e07ce413d16ef64e885bea37551eac4c5ca3ddd440933f9c94594273d0d9": [], "e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4": [], "e1127810de8b60a58bfa682f858fd7ba36667d29c0b9ad3b6ff10d6cb944": [], "e1956f7ca582a22dd1f17b9e26fcb8229051b0ce6d33b47227824772feec": [], "e2": [], "e3": [], "e4": [], "e5": [], "e7": [], "e8": [], "e8c04e80e82391a6e51f218ca49720f64236bc824e92152a2633b74cf7ab": [], "e9": [], "e9fcff7623954d86bdc17782036cbf715ecab1bec4847c008557affe1ca8": [], "e_": 8, "ea": [], "each": [6, 8, 9, 11, 12, 17, 19, 21, 23, 25, 26, 27, 28], "ead346e904390a53e71b5da2df7e7839abb16e967ba07fa15addf1f9f37c": [], "earli": [8, 13, 19, 28], "earlier": [8, 19, 26], "earliest": 8, "easi": 8, "easier": [7, 17, 19], "easili": 17, "easy_gener": 10, "eb": [], "ebnj33fcrl": 0, "ec": [], "ecal": [5, 28], "eck": [0, 17], "econom": 20, "economi": 20, "ect": [0, 11, 13], "ed": [], "edg": 18, "edgar": [], "edict": [], "ediff": [], "edit": [0, 2, 5], "editor": [0, 5, 8, 9], "edmsound": 0, "educ": [7, 15], "edward": [], "edwin": [], "ee": [], "ee39c6e92acc742c052f137b47c210cd0a1b72dcd3f98495528bb4d27761": [], "eerili": 11, "eess": [0, 6, 8], "effect": [0, 5, 12, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "effici": [0, 8, 9, 11, 13, 15, 28], "effort": [20, 25], "efro": [], "egregi": 20, "ehgr20": [0, 13], "ehohc": [], "ehsan": [], "eikan": [], "einop": 10, "einops_ext": [], "einsum": 24, "either": [6, 8, 9, 10, 11, 19, 28], "elabor": 12, "elbmg07": [0, 17], "electr": 24, "electrifi": [4, 24], "electron": [0, 5, 16, 28], "element": [9, 10, 24, 28], "elena": [0, 8, 9], "eleph": 2, "eleventh": 0, "eli": [], "elia": [], "elio": [0, 6, 8, 9, 15, 18, 28], "elizald": [0, 8], "ell": 11, "elli": [0, 18, 28], "ellison": [], "eloi": [], "els": [4, 10, 16, 24], "elsen": [], "elucid": [], "ema": [], "ema_pytorch": [], "emanuel": 0, "emb": [4, 11], "embed": [0, 2, 3, 4, 6, 8, 11, 12, 14, 16, 18, 19, 21, 23, 24, 25, 26], "embedding_cat": 4, "embedding_prefix": 4, "embedding_text": 4, "embeddings_2d": 16, "emed": 28, "emerg": [8, 9, 13, 15, 19, 25], "emili": [], "emilian": 0, "emir": [0, 8, 9], "emmanouil": [0, 5, 6, 8, 9, 15, 18, 23, 28], "emmanouilid": [], "emnlp": [0, 8, 9], "emot": [0, 4, 9, 13, 20, 24], "emphas": [18, 25, 26], "emphasi": [15, 17], "empir": [0, 8, 9, 19], "emploi": [8, 14, 20, 21], "emr": [], "emu": [], "en": 10, "enabl": [4, 5, 7, 8, 13, 15, 16, 18, 19, 23, 24, 25, 27, 28], "enchant": 4, "encod": [3, 4, 11, 13, 14, 17, 18, 24, 28], "encodec": [11, 14, 19], "encompass": [9, 23], "encount": 28, "encourag": [11, 15, 19, 28], "end": [0, 5, 6, 11, 17, 21, 24], "endeavour": 6, "energet": [24, 28], "energi": 20, "enforc": 21, "engag": 25, "engel": [0, 5, 17, 18], "engin": [15, 17, 20], "english": [5, 19, 20, 21], "enhanc": [0, 3, 5, 9, 13, 18, 19, 28], "enjoi": 9, "enorm": [], "enough": [4, 19], "ensembl": [], "ensur": [4, 12, 26, 28], "enter": 17, "entir": [8, 9, 11, 12, 19, 23, 28], "entiti": 28, "entropi": [12, 14, 28], "enumer": 16, "env": 10, "envinro": 10, "environ": [10, 15], "eos_token_id": 4, "eot": 21, "ep": [], "epc": [0, 2, 11], "epoch": [4, 24], "epoch_loss": [4, 24], "epstein": [], "epur": [0, 8, 9], "equal": 4, "equat": [0, 11], "equilibrium": 0, "equit": 20, "equival": 19, "er": 20, "era": [9, 15], "eri75": [], "eric": [], "erich": [], "erickson": [], "erik": [0, 9], "ermon": 0, "err": [0, 13, 17], "error": 19, "escap": 5, "escriv": [], "esl": 0, "especi": [9, 11, 19, 20, 21], "essenti": [11, 12, 18, 19], "esser": [], "establish": [8, 9, 15, 26], "estim": [0, 27], "et": [4, 8, 9, 24, 25], "eta": [], "etc": [11, 16, 21, 27], "ethan": 0, "euclidean": 26, "eugen": [], "eunggu": [], "evad": [4, 24], "eval": [4, 24], "evalu": [0, 3, 5, 7, 8, 9, 13, 15, 18, 19, 23, 24, 28], "evan": 0, "even": [2, 4, 6, 9, 12, 19, 28], "event": [9, 23], "ever": 24, "everi": [8, 19, 21], "evgeni": [], "evolut": [3, 7, 9, 18, 23, 25], "evolv": [13, 17, 23, 28], "exact": [6, 8, 11], "exactli": [21, 24], "exam": 21, "examin": [3, 18, 28], "exampl": [2, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 19, 21, 23, 25, 27, 28], "excel": [12, 16], "except": 21, "excit": [2, 3, 16, 24], "exclus": 23, "execut": 19, "exemplifi": [13, 20], "exercis": [2, 18], "exhibit": 20, "exist": [2, 5, 7, 8, 11, 16, 17, 19, 28], "exit": [], "exp": [8, 24, 28], "expand": [4, 5, 13, 27, 28], "expect": 9, "experi": [0, 4, 9, 23, 24, 25, 28], "experiment": [0, 28], "expert": [], "expertis": 15, "explain": [7, 15, 27], "explicit": 23, "explicitli": 10, "exploit": [], "explor": [0, 3, 9, 13, 15, 18, 23, 24, 25, 28], "export": 10, "expos": 28, "express": [0, 2, 9, 13, 16, 23, 25, 28], "expressivenss": 13, "ext": [], "extend": [0, 8, 24], "extens": [19, 21], "extern": [0, 19], "extra": 11, "extract": [2, 4, 11, 24, 28], "extractor": 8, "extrem": 21, "f": [4, 5, 6, 9, 10, 11, 24, 28], "f0": [], "f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588": [], "f185bfd0ca1d213beb4293bed51d92254df23d8ceaf6c0e17146d508a776": [], "f2": [], "f2b75d2fc6f1a260f340f0e7c6a060f4dd2961cc16884ed851b0d18da06a": [], "f4": [], "f5": [], "f6": [], "f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054": [], "f7": [], "f7e21b113dd48a9c97d364e0915b3988c6a0b6207652f5a92372871b7aa4": [], "f9": [], "f9d7fe80a8fcce9bb128d1381c6fe41a8d286d7e18395e273002e8e0fa34": [], "f_": [8, 11], "fa": [], "fabien": [0, 28], "face": [7, 13, 24, 25, 27], "facilit": 15, "fact": [8, 11], "factor": [0, 26], "fadernet": [], "fail": [2, 19, 23], "failur": 2, "fair": [20, 26], "fall": [8, 27], "fals": [4, 5, 10, 24, 26], "familiar": [4, 24], "fan": 4, "fandong": [], "fang": [], "fantast": 11, "far": [2, 4], "farid": [], "fashion": [9, 23], "fast": [0, 5], "fastapi": [], "fastcor": [], "faster": 19, "fatigu": 12, "favor": 2, "favour": 8, "fazeka": [0, 6, 8, 9, 15, 18, 28], "fb": [], "fc": [], "fd": [], "feasibl": 4, "featur": [0, 2, 4, 5, 7, 8, 9, 10, 11, 14, 17, 19, 21, 23, 24, 27, 28], "fed": 21, "federico": [], "fedu": [0, 18], "feed": 19, "feedback": [0, 3, 12, 18, 23, 25], "feel": 5, "felix": [0, 18], "femal": [4, 24, 28], "feng": [], "ferjad": [], "fernando": [0, 25], "few": [2, 4, 10, 11, 19, 26], "fewer": 26, "ff": [], "ff642e65ad6b90db43e668d70ffb6736436c7ce41fcc549f4e9472234127": [], "ffbf7a134b9ab11a67b0cf0726453cedd9c5043a4fe7a35d1cefa9a1bcfb": [], "ffmpy": [], "fid": [], "fidel": [0, 20], "fidler": [], "field": [3, 13, 15, 19, 20, 22, 28], "figsiz": 16, "figur": [13, 14, 16, 21], "file": [10, 24], "filelock": [], "filip": [0, 18, 25], "filippo": [], "fill": [19, 21, 22], "film": [7, 11, 21], "filter": [17, 27, 28], "filterwarn": [10, 16], "final": [4, 6, 7, 9, 11, 12, 13, 15, 19, 27], "find": [0, 4, 5, 7, 8, 9, 17, 19, 20, 23, 24, 26, 27], "fine": [0, 2, 5, 9, 11, 19, 28], "finetun": [0, 8, 11, 18], "finit": 21, "finnicki": 10, "fire": [], "firmli": 15, "first": [4, 6, 8, 9, 10, 11, 12, 13, 15, 17, 19, 21, 23, 24, 26, 27, 28], "fisch": [], "fischer": [], "fit": [2, 8, 11], "fit_transform": 16, "fix": [8, 9, 11, 17, 19, 21, 28], "fjeld": [], "flamingo": 19, "flash": [], "flashattent": [], "flat": 12, "flatten": 24, "flatten_dict": [], "flavio": [], "fleet": [], "flexibl": [3, 9, 15, 16, 19, 21, 22, 27, 28], "flexibli": 19, "float": 11, "float32": 10, "flore": 0, "florencia": [0, 18], "flori": 0, "florian": [], "flow": 0, "fltz10": [0, 17], "fm": 24, "fm22": [0, 11], "fma": [], "fn": 26, "focu": [2, 9, 11, 13, 15, 17, 23, 25, 28], "focus": [3, 6, 9, 13, 15, 16, 17, 18, 21, 26, 28], "folk": 24, "follow": [0, 2, 3, 4, 7, 8, 10, 11, 12, 13, 14, 17, 18, 19, 21, 26, 28], "fontsiz": 16, "fonttool": [], "foot": 22, "forc": 23, "foreign": [4, 24], "forget": 21, "forgo": 8, "fork": 10, "form": [2, 5, 6, 7, 8, 9, 11, 16, 19, 27], "formal": [2, 11, 23], "format": [4, 19, 22, 28], "formul": [7, 18, 23, 28], "forsgren": 0, "forth": [23, 24], "forum": [0, 8], "forward": [4, 11, 13, 14, 24], "fossez": [0, 18], "foster": 15, "found": [8, 19], "foundat": [0, 4, 8, 9, 15, 16, 18, 24, 28], "four": 26, "fourier": 0, "fp": 26, "fr": 0, "frac": [8, 26, 28], "fragkiadaki": [], "frame": 4, "framework": [3, 4, 8, 15, 18, 19, 22, 27, 28], "fran": 0, "franci": [], "francisco": 15, "francoi": 0, "frank": 0, "frechet": [], "freder": [], "fredo": [], "free": [0, 2, 4, 10, 24], "freedman": [], "freedom": [], "freeman": 0, "freeu": [], "freez": [4, 24], "freeze_backbone_model": 4, "freeze_parma": [4, 24], "french": 19, "fresh": 23, "fri": [], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28], "from_pretrain": [4, 24], "frontier": 19, "frozen": [8, 19], "frozenlist": [], "fsspec": [], "ftfy": [], "fu": [0, 17], "full": [2, 9, 13, 19, 23, 24], "fulli": [2, 5, 8, 10, 11, 20, 23], "function": [2, 4, 5, 8, 11, 13, 14, 24, 26], "functool": 10, "fundament": [17, 25, 26, 28], "furkan": 0, "further": [8, 9, 11, 13, 15, 18, 19, 22, 28], "furthermor": [16, 18, 28], "fuse": [], "fusion": [0, 28], "futga": [0, 5, 9], "futur": [3, 15, 17, 20, 21, 23, 28], "futurewarn": [4, 10, 24], "g": [0, 2, 4, 5, 8, 9, 10, 11, 19, 22, 25, 26, 27, 28], "g_": 11, "ga": 0, "gaa": [], "gabbolini": [0, 8, 9], "gabriel": [0, 18, 28], "gain": 28, "gal": [], "game": [], "gamma_": 11, "gamper": [], "gan": [0, 11, 19, 21], "ganguli": [], "ganti": [0, 18, 25, 28], "gao": [0, 9], "gap": [3, 23, 27, 28], "garcia": 0, "gardner": [0, 8, 9, 18], "gareth": 0, "gat": [0, 18], "gate": 11, "gaussian": 11, "gayoung": [], "gdsb23": [0, 8, 9, 16, 18], "ge": [0, 5, 8, 9], "geeta": [], "gef": [], "gemmek": [], "gen": [], "gener": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 28], "generalis": 9, "generate_diffusion_cond": 10, "generated_text": 4, "genr": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "geoffroi": [0, 9], "geometr": [6, 28], "geon": [], "georg": [0, 6, 8, 15], "geq": [], "gerhard": [], "germain": 0, "german": 19, "gert": [0, 17, 18, 27], "gestin": [], "get": [6, 10, 11, 19], "get_device_nam": [4, 24], "get_ipython": 4, "get_item_vector_db": 24, "get_pretrained_model": 10, "get_query_embed": 24, "ggbe24": [], "ggdre22": [0, 13], "ggre21": [], "ghanem": [], "gharbi": [], "ghasemipour": [], "ghe22": [0, 8, 9], "gherardi": [], "ghosal": [0, 5], "gil": [], "gimelshein": [], "gin": [], "gin_config": [], "ginneken": [], "ginsburg": [], "giorgi": 0, "giorgio": 0, "giovanni": [0, 8, 9], "girish": [0, 28], "girl": 24, "git": 15, "git18": [], "gitdb": [], "github": [4, 10, 15, 24], "gitpython": [], "give": [5, 8, 9, 11, 19], "given": [2, 8, 9, 11, 12, 13, 17, 19, 21, 22, 26, 27], "gl83": [0, 12], "glasner": [], "global": [9, 11, 28], "glove": 28, "gltq23": [0, 9], "gmmp23": [], "go": [2, 9, 11, 13, 18, 19, 22, 24], "goal": [3, 7, 9, 11, 19, 28], "goe": 19, "goel": 0, "goh": [0, 28], "gokul": [], "golai": [], "gold": 6, "goldberg": [0, 8, 9], "golden": [], "gome": [], "gomez": [0, 5, 8, 9], "gone": 7, "gong": [], "gongfan": [], "gontijo": [], "good": [2, 11, 12, 19], "goodfellow": 0, "googl": [4, 13, 15, 24], "gordon": 0, "got": [4, 10, 24], "gouyon": [0, 28], "goyal": [], "gpt": [0, 4, 5, 8, 15, 18, 19, 22, 28], "gpt2": 4, "gpt2lmheadmodel": 4, "gpt2token": 4, "gpu": [4, 24], "grachten": 0, "gradient": [0, 2, 11, 22], "gradio": [], "gradio_cli": [], "gradual": 11, "grai": 21, "grain": [0, 2, 5, 9, 28], "gram": [6, 22], "grandios": 4, "granular": 27, "graph": [6, 19, 28], "grave": [0, 17], "great": [2, 4, 11, 20], "greater": [13, 28], "greatest": 16, "green": [0, 13, 16, 17, 21], "greenwood": 0, "greg": [], "gregori": [], "grew": 17, "griffin": 0, "gritsenko": [], "groh": [], "grosch": [0, 28], "gross": [], "ground": [4, 6, 24, 26], "groundwork": 17, "grounth": 6, "group": 0, "grow": [2, 25, 27], "grown": 13, "grpcio": [], "grug17": [], "gschwind": [], "gskp23": [0, 2, 13], "gt": 4, "gu": 0, "guanglu": [], "guangzhi": [0, 8], "guestrin": [], "gui": [], "guid": [0, 2, 5, 8, 13, 23, 25], "guidanc": [0, 2, 10, 15], "guitar": [2, 5, 10, 16, 24], "gulrajani": [0, 17], "gunjan": [], "guo": [0, 5, 8, 9], "guojun": [0, 17], "gupta": [], "gupta2023photorealisticvg": [], "guu": [0, 18], "gy": [0, 8, 9, 18, 28], "gy\u00f6rgi": [0, 9], "gz": [], "h": [0, 8, 11], "h11": [], "h5py": [], "h_audio": 24, "h_text": 24, "ha": [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 23, 27, 28], "had": [11, 28], "hadjer": 0, "hai": [0, 6], "hall": 0, "hallaci": [0, 28], "hallucin": [19, 20], "han": 0, "hand": [7, 9, 19], "handl": [11, 16, 17, 19, 23, 25], "hang": [], "hani": [], "hann": [], "hantrakul": 0, "hao": [0, 5, 9], "haoh": [0, 18], "haoran": [], "haoxin": [], "haoyi": [0, 28], "happen": [4, 19], "happi": [5, 16, 27, 28], "hard": [2, 4, 7, 11, 19, 28], "harder": 9, "harm": [], "harmon": [2, 6], "hat": [8, 11], "have": [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 22, 23, 24, 27, 28], "hawlei": 0, "hawthorn": 0, "hayk": 0, "hce": [], "he": [0, 9, 15], "head": [19, 28], "hear": [0, 7, 8], "heard": 5, "heart": 24, "heavi": [5, 28], "heewoo": [0, 18], "heiga": [0, 17], "height": 11, "helen": [], "helena": [0, 5, 8, 9], "hellsten": [], "help": [0, 3, 6, 12, 17, 19, 26, 28], "hennequin": [0, 8, 9], "her": [4, 15, 24], "herald": 15, "herbert": [], "herd": [], "here": [2, 7, 10, 11, 21, 24], "herman": [], "herreman": [0, 5], "herrera": [0, 9], "hershei": [], "hertz": [], "hertzmann": [], "hesit": 3, "heusel": 0, "hf_token": 10, "hh": [], "hhl": [0, 9], "hhy": [], "hi": 15, "hi79": [0, 13], "hidden": [11, 21], "hierarch": 0, "high": [0, 2, 8, 9, 11, 12, 16, 17, 19, 27, 28], "higher": [6, 7, 11, 13, 26], "highest": 26, "highli": 12, "highlight": [2, 4, 7, 10, 15, 23, 25], "hila": 0, "hiller": 0, "hilton": [], "hing": 28, "hint": 19, "hiromi": [0, 6], "hirsh": [], "histor": [21, 23], "histori": [9, 22, 23, 25], "hit": 11, "hja20": [0, 13], "hjc": [], "hjl": [0, 16, 18, 28], "hla": [], "hlss23": [0, 5, 8], "hmt": [], "ho": 0, "hochreit": 0, "hoffman": 0, "holger": [0, 28], "holist": [0, 19], "holoview": [], "holynski": [], "hongsheng": [], "hook": 24, "hope": 3, "horac": [], "hot": [16, 17, 18, 19], "hotel": 28, "hou": [0, 18], "how": [0, 2, 3, 4, 6, 7, 8, 9, 11, 15, 17, 18, 19, 21, 22, 23, 25, 26, 28], "howev": [8, 9, 11, 12, 15, 17, 19, 21, 23, 25, 27, 28], "hpn17": [0, 13], "hpw": [], "hru": [0, 12], "hs21": [], "hsg": [], "hsiang": [0, 6], "hsiao": 0, "hsin": 0, "hsr": [], "hsuan": [0, 17, 18], "ht": 28, "html": 5, "http": [0, 4, 5, 6, 8, 9, 10, 15, 24], "httpcore": [], "httpx": [], "hu": [], "huam": [0, 8], "huang": [0, 5, 8, 9, 18, 28], "hub": [], "hubert": 0, "hug": 24, "huge": [16, 19, 20, 21, 22], "huggingfac": [4, 10, 24], "huggingface_hub": 10, "hugo": 0, "hui": [0, 28], "huiwen": 0, "hum": 5, "human": [0, 6, 7, 9, 13, 15, 17, 18, 20, 23, 25, 27, 28], "humphrei": [], "hundr": 19, "hussain": [0, 5, 8], "hvu": [0, 13], "hy20": [0, 13], "hybrid": 8, "hyelin": [], "hyper": 2, "hyperparamet": 28, "hyung": [0, 18], "hyungjin": [], "hzrs16": [], "i": [0, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], "ian": 0, "icassp": [0, 5, 8, 9, 15, 18, 28], "icassp48485": [0, 5, 8], "iccv": 0, "iclr": [0, 17], "icml": [0, 15, 18], "id": [0, 5, 8], "idea": [11, 16, 19, 28], "ideal": 19, "ident": 11, "identifi": [12, 19, 26, 28], "idf": 6, "idna": [], "ieee": [0, 5, 8, 9, 15, 17, 18, 27, 28], "ieeexplor": [0, 9], "iffus": [], "ignor": [10, 16, 23], "ignore_index": 4, "ijcai": [0, 8], "ijcnn": [0, 8, 9, 18], "ijcnn54540": [0, 9], "ikemiya": 0, "il": [], "ilaria": [0, 3, 5, 6, 8, 9, 15, 18, 23, 28], "ilg": [0, 18], "illia": 0, "illustr": [13, 14, 18, 19], "ilya": [0, 18], "imag": [0, 6, 8, 11, 12, 16, 17, 19, 20], "imagegpt": 19, "imageio": [], "imagin": [2, 11, 24], "imbu": 11, "immers": [], "impact": [8, 16, 20], "imperi": 15, "implement": [8, 20, 22, 24, 28], "implicit": 23, "implicitli": 11, "import": [3, 4, 5, 10, 11, 16, 19, 23, 24, 25, 26, 28], "importlib": [], "importlib_resourc": [], "impract": 27, "impress": [13, 24], "improv": [0, 4, 13, 19, 23, 24, 25, 26, 28], "inabl": [2, 23, 25], "inaccur": 19, "inbar": [], "inc": 0, "includ": [4, 6, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 22, 26], "inclus": [12, 15], "incorpor": [4, 9, 13, 14, 18, 19, 23, 25, 28], "incorrect": 19, "incorrectli": 26, "increas": [19, 20, 27, 28], "increasingli": [3, 7], "incredibli": 24, "independ": [23, 25], "index": [4, 11, 24], "indi": 5, "indian": [4, 24], "indic": [12, 19, 21, 24, 26, 27], "individu": 28, "indulg": 15, "infer": [0, 2, 10, 16, 18, 19, 23, 27], "infinit": [], "influenc": [2, 8, 18, 24, 28], "influenti": [21, 22], "info": [], "infonc": 28, "inform": [0, 2, 4, 6, 8, 9, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27], "informat": [4, 24], "infus": [4, 24], "inher": 23, "init": 2, "init_temperatur": 24, "initi": [4, 8, 11, 13, 17, 23, 27], "initialis": 8, "inlin": [], "innov": [13, 15, 20, 28], "inpaint": 2, "input": [2, 4, 8, 9, 11, 12, 13, 14, 17, 21, 22, 28], "input_id": [4, 24], "inputs_emb": 4, "insid": [11, 21], "insight": 12, "inspir": [4, 13], "instal": [4, 10, 15], "instanc": [21, 28], "instead": [5, 6, 7, 8, 9, 11, 19, 21, 23], "institut": [0, 17, 20, 27], "instruct": [0, 5, 6, 9, 18], "instruct_2023": [], "instructgpt": 19, "instructpix2pix": [], "instrument": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "int": [4, 24], "int16": 10, "integ": 21, "integr": [11, 12, 13, 18, 19, 25], "intellig": [0, 8, 15, 19, 28], "intend": [13, 22], "intens": 2, "intent": [0, 16, 19, 23, 25], "interact": [9, 13, 15, 19, 20, 23, 25, 28], "interest": [2, 7, 15, 17, 22, 23, 24, 25], "interestingli": 19, "interfac": [8, 15, 25], "interleav": [11, 19], "intermedi": [0, 8], "intern": [0, 5, 6, 8, 9, 11, 15, 17, 18, 19, 20, 25, 28], "internationaltunion01": [], "internet": [20, 28], "interpret": 19, "interspeech": 0, "interv": [], "intervent": 0, "intric": 4, "intro": 11, "introduc": [3, 11, 13, 14, 18, 19, 21, 25, 28], "introduct": 18, "intuit": [2, 16, 19, 25, 27], "invalu": 12, "invers": [0, 2], "invert": [], "invertib": 2, "investig": [3, 18], "invok": 19, "involv": [10, 11, 17, 19, 26], "io": [], "ipd": 10, "ipython": [4, 5, 10, 24], "iqbal": 0, "irani": [], "iren": 0, "irrelev": [26, 28], "is_avail": [4, 10, 24], "isaacson": 0, "isca": 0, "ish": 4, "ishaan": [0, 17], "ismir": [0, 8, 9, 11, 15, 17, 18, 28], "ismir2008": 17, "ismir2019": 17, "ismir2021": 17, "isn": [5, 6, 19, 23], "isola": [], "isotrop": 11, "issn": [0, 9], "issu": [4, 6, 10, 23, 24], "itai": [0, 18], "item": [0, 4, 5, 8, 9, 16, 18, 24, 25, 26], "item_joint_embed": 24, "item_vector_db": 24, "iter": [2, 8, 23], "its": [2, 4, 7, 8, 9, 12, 14, 15, 16, 17, 19, 20, 22, 23, 27, 28], "itself": [2, 5, 9, 11, 19], "itu": 0, "izze17": [], "j": [0, 18, 28], "jaakko": [], "jaakkola": [], "jack": [0, 28], "jacob": [0, 18, 28], "jacquelin": [0, 9], "jade": [0, 18], "jae": 0, "jaegl": [], "jaejun": [], "jaesik": [], "jaewoong": [], "jai": [], "jain": [0, 17], "jakob": 0, "jamendo": 5, "jampani": [], "jan": [0, 5], "janko": [0, 18], "jann": 0, "janner": [], "jansen": [0, 5, 18, 28], "jargon": 19, "jasa": [], "jascha": 0, "jasco": 2, "jason": [0, 18], "jauhri": [], "javier": 0, "jayasumana": [], "jazz": [27, 28], "je": [], "jedi": [], "jeffrei": [0, 9, 18], "jen": [], "jeong": [0, 18, 28], "jeongsol": [], "jess": [0, 5, 17, 18], "jessica": [], "ji": [], "jiacheng": [], "jiaheng": [0, 5, 9], "jiahui": 0, "jiaji": [], "jiajia": [0, 6], "jiam": [], "jian": [], "jianbin": [], "jianfei": [], "jiang": [0, 18], "jianglong": [], "jianlin": [], "jianmin": [], "jianxin": [0, 28], "jiawei": [], "jiayi": [], "jie": [], "jimmi": [], "jin": [0, 23], "jinan": [], "jinbo": [], "jing": [], "jinglin": [], "jingren": [], "jingwei": 0, "jinja2": [], "jiong": [], "jitong": 0, "jiwen": [], "jiyoung": [0, 18, 28], "jmespath": [], "jnmr": [0, 9], "joao": [], "joar": [], "job": 20, "joblib": [], "john": 0, "join": [4, 8, 24], "joint": [0, 2, 3, 8, 9, 18, 23, 25], "joint_dim": 24, "jointembeddingmodel": 24, "jointli": 28, "jona": [], "jonah": 0, "jonathan": 0, "jone": 0, "jong": [0, 15, 18, 28], "jongmin": [], "jongpil": [0, 8, 9, 17, 18, 28], "jongwook": 3, "joon": [], "joonseok": [0, 18, 28], "jooyoung": [], "jordi": 0, "jort": [], "jose": [0, 17], "josef": [0, 5], "joseph": [], "josh": [0, 8, 9, 18], "joshua": [], "josiah": 0, "journal": [0, 9, 17, 18, 23], "journei": 17, "joy": 5, "jrv": [], "jsonmerg": [], "jsonschema": [], "ju": 0, "juan": 0, "judg": 6, "judgement": 6, "judith": [0, 18, 28], "juhan": [0, 8, 9, 15, 17, 18, 23, 25, 28], "juho": [], "jukebox": [0, 13, 15, 18], "jukedrumm": [], "julian": [0, 5, 9, 15, 17, 18], "julio": [], "juliu": [], "jump": 11, "jun": [0, 18], "junbo": 0, "junda": [0, 5, 9], "june": [0, 5, 8, 9], "junghyun": 0, "junho": [], "junyan": 0, "jupyt": 15, "just": [10, 11, 12, 19, 20, 21, 23, 24, 25, 27], "justin": [0, 5, 28], "k": [8, 9, 14, 24, 26], "k_diffus": [], "kaal22": [], "kadian": [], "kai": [0, 17], "kaim": [], "kaiser": 0, "kaist": 15, "kakao": 15, "kal": [], "kalambarkar": [], "kalchbrenn": [0, 17], "kamko": [], "kamyar": [], "kang": [], "kant": [0, 18], "kao": [], "karagol": [], "karan": 0, "karen": [0, 17], "karra": [], "karsten": [], "karunratanakul": [], "kastner": [], "katarina": [0, 18], "kate": [0, 8], "katerina": 0, "katharopoulo": 0, "katherin": [0, 18], "kavukcuoglu": [0, 17], "kawar": [], "kazuhito": [], "kb": [], "kb14": [], "kbockw15": [], "kci": [0, 12], "ke": [0, 3, 5, 8, 15, 18, 19, 23, 28], "keep": 24, "keepdim": 16, "kei": [3, 7, 8, 9, 10, 12, 13, 14, 17, 18, 23, 26, 27, 28], "keji": [], "kelvin": [0, 18], "kenton": [0, 18, 28], "kept": 8, "keqiang": [], "keren": [], "keunwoo": [0, 8, 9, 17, 18, 23, 25, 28], "kevin": [0, 5, 8, 9], "kexin": [], "keyword": [0, 13, 28], "kfir": [], "kharitonov": [], "khurana": 0, "khz": 11, "ki": [], "kilgour": 0, "kilian": [], "kim": [0, 9, 15, 18, 23, 25, 28], "kind": [5, 9, 19, 21], "kingma": 0, "kirchhoff": [0, 28], "kirel": [], "kirkpatrick": [0, 8, 15, 18, 28], "kirsch": [], "kiwisolv": [], "kjz24": [], "kkdb": [], "kkkm23": [], "kl": [11, 12], "klaski": [], "kll": [0, 11], "knife": [0, 6], "knob": 10, "knolwedg": 4, "know": [9, 21], "knowledg": [0, 4, 8, 9, 16, 19, 20, 21, 24, 28], "known": [10, 27, 28], "koh": [], "kohler": [], "koichi": [], "koishida": [], "kong": 0, "konpat": [], "koo": 0, "korai": [0, 17], "kornia": [], "kornia_r": [], "korraw": [], "korzeniowski": [], "kosta": 0, "kostrikov": [], "kozareva": [0, 8, 9], "kpa": [], "kph": [], "kpschonfeld": [], "krasheninnikov": [], "kreb": [], "krei": [], "kreuk": [0, 18], "kristina": [0, 9, 18, 28], "krisztian": [0, 18, 25], "krueger": [], "kshiteej": [], "ksl": [0, 11], "ksm": [0, 9], "ksp": [0, 13], "ku": [], "kullback": 12, "kumar": [0, 17], "kundan": [0, 17], "kuznetsov": 0, "kw13": [], "kwak": [], "kwan": [], "kwg": [0, 2], "kwon": [0, 23, 25], "kyle": [], "kynk": [], "kynkaanniemiak": [], "kyunghyun": [0, 9], "kzb": [], "kzl": [], "kzrs18": [], "kzrs19": [0, 12], "kzz": [], "l": [6, 8, 9, 28], "l1": 14, "l177": 10, "l2": 14, "lab": 15, "label": [0, 4, 7, 9, 12, 17, 19, 21, 24, 26, 27], "lack": [2, 11, 19, 23, 28], "lai": 0, "laid": 17, "lain": [], "laion": [], "laion_clap": [], "lala": [], "lam": [], "lam08": [0, 17], "lama": [0, 18], "lamer": [0, 17], "lamtharn": 0, "lanckriet": [0, 17, 18, 27], "land": [], "lang": 0, "langaug": [24, 28], "languag": [0, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 20, 23, 24, 25, 27, 28], "lanzendarferppw": [], "lanzendarferppw24": [], "lanzendorf": [0, 8], "lanzendorfer_blap_nod": [], "lanzend\u00e3": [], "larg": [0, 4, 5, 6, 7, 8, 11, 18, 19, 20, 23, 24, 25, 26, 27, 28], "larger": [4, 19, 24], "larson": [0, 8], "last": [4, 7, 10, 16, 19, 24], "last_hidden_st": 24, "lastli": [20, 21], "late": [0, 8], "latenc": [19, 20], "latent": [0, 2, 8, 11, 13, 14, 24, 28], "later": [8, 11, 13, 19, 22], "latest": [7, 15, 22], "latin": 20, "latter": 9, "lattner": 0, "lau": [], "launch": 15, "laura": [], "laurent": [], "laurier": [0, 17], "lav": [], "lawrenc": [], "layer": [0, 8, 11, 19, 21, 27, 28], "lazi": [], "lazo": [], "lazy_load": [], "lcy": [0, 11, 13], "ldm": 0, "ldot": [8, 9, 26], "le": [0, 18], "leach": [], "lead": [6, 12, 13, 19, 23, 25, 27, 28], "learn": [0, 2, 3, 4, 7, 8, 9, 11, 12, 13, 15, 17, 18, 21, 22, 23, 25, 26, 27], "learnabl": [8, 28], "learner": [0, 4, 16, 18], "learnt": 4, "least": [6, 11, 26, 28], "leav": 23, "led": 13, "lee": [0, 8, 9, 17, 18, 23, 28], "lee10": [0, 23], "leemput": [], "left": [8, 11, 19], "legend": 16, "legg": 0, "lehtinen": [], "lei": [], "leibler": 12, "lejaren": 0, "lejun": 0, "len": [4, 11, 16, 24], "length": [5, 6, 9, 10, 11, 19, 20], "lenient": 26, "leo": 0, "leonard": 0, "leoni": [0, 18], "lerman": [0, 9], "less": [12, 19, 20, 26], "lester": [0, 18], "leszczynski": [0, 18, 25], "let": [2, 6, 8, 9, 16, 19, 20, 21, 28], "letman": [], "letter": [0, 9, 19], "level": [0, 2, 6, 7, 8, 9, 11, 12, 16, 22, 27, 28], "leverag": [2, 3, 4, 8, 12, 13, 14, 16, 19, 25], "levi": 0, "levin": [], "lexic": 6, "lezcano": [], "lgw": [0, 2], "lhss24": [0, 5, 8], "li": [0, 6, 8, 9, 12, 18, 28], "liang": 0, "liao": [0, 6], "lib": [4, 10, 24], "librari": [], "librosa": 10, "licens": [4, 5], "lick": 5, "lifeng": [], "light": 8, "lightn": [], "lightning_util": [], "like": [2, 5, 6, 8, 9, 10, 11, 13, 15, 16, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28], "likelihood": 17, "lim": 0, "limit": [0, 3, 9, 13, 16, 17, 18, 20, 21, 22, 27, 28], "lin": [0, 9], "lina": [], "linalg": 16, "line": [4, 8, 10, 16, 19, 21, 24, 28], "line2d": 16, "linear": [4, 6, 8, 11, 24], "lingelbach": [], "linguist": [0, 5, 8, 9], "link": 4, "linkifi": [], "linmiao": [], "linux": 10, "lior": [0, 17], "lipman": [], "list": [4, 11, 19, 24, 26], "listen": [0, 4, 5, 10, 17, 24], "literatur": [6, 9, 19], "liu": [0, 5, 8, 9, 18, 24, 28], "liu19": [0, 28], "liu_music_2024": [], "live": [5, 21], "liwei": [], "lka": [], "lkopf": [], "ll": [2, 4, 8, 10, 11, 21, 22], "llama": [0, 5, 8, 19], "llark": [0, 8, 9, 18], "ller": [], "llion": 0, "llm": [0, 3, 5, 6, 18, 19, 20, 28], "llvmlite": [], "lm": [2, 11, 13, 18], "lmn23": [], "lmnt": [], "lmz": [], "ln17": [0, 9], "load": [10, 24], "load_dataset": [4, 5, 24], "loader": [], "local": [0, 4, 10, 24], "local_attent": [], "localis": 9, "locat": 21, "lockhart": [], "log": [11, 19, 24, 28], "logit": [4, 17, 19, 24, 27], "logit_scal": 24, "loi": 0, "london": 15, "long": [0, 9, 11, 12, 17, 18, 19, 21, 22, 24, 27], "longbo": [], "longer": [6, 9, 11, 19], "longest": [4, 6, 24], "longpr": [0, 18], "look": [7, 9, 11, 18, 19, 21, 23, 26, 27], "looper": 2, "lope": [], "lorenz": [], "loss": [3, 4, 14, 19, 23, 24, 25], "loss_a2t": 24, "loss_t2a": 24, "lost": [], "lot": [11, 16, 19, 21], "loui": [], "low": [2, 12, 16, 27], "lower": [12, 19, 26], "lowest": 19, "lp": [0, 4, 5, 8, 18, 24], "lpg": [], "lppw24": [0, 8], "lr": [4, 24], "lsp": [0, 9], "lstm": 22, "ltgm19": [], "lth": [], "ltl": [], "ltu": 8, "lu": [0, 6, 8, 9, 17], "luan": [0, 18], "luca": [0, 8], "luckili": 2, "lueb": 0, "luk": [], "lukasz": 0, "luke": [0, 17, 18, 27], "lun": [0, 18], "lunch": [], "luo": [], "luong": [], "lupe": [], "lvmin": [], "lxjz23": [], "ly": [], "lyl": [0, 11], "lyric": [9, 16, 23], "lyt": [0, 6], "lzb": [], "lzg": [0, 25], "m": [0, 8, 9, 15, 18], "m1": 10, "m2ugen": [0, 5, 8], "ma": [0, 5, 8, 9], "maarten": [0, 18], "mac": 10, "mach": 0, "machan": 8, "machin": [0, 6, 7, 8, 13, 15, 17, 18, 19, 21, 28], "maciej": [], "macosx_10_10_x86_64": [], "macosx_10_12_x86_64": [], "macosx_10_13_x86_64": [], "macosx_10_15_universal2": [], "macosx_10_15_x86_64": [], "macosx_10_5_x86_64": [], "macosx_10_6_intel": [], "macosx_10_9_intel": [], "macosx_10_9_universal2": [], "macosx_10_9_x86_64": [], "macosx_11_0_arm64": [], "macosx_14_0_x86_64": [], "madmom": [], "maestro": [0, 6], "magazin": [0, 17, 18], "magenta": 13, "magnatagatun": [4, 5, 24, 28], "magnatagtun": [], "magnitud": [], "maher": [], "maheswaranathan": [], "mahieux": [0, 17], "mai": [2, 5, 9, 12, 13, 19, 21, 23, 25], "main": [0, 2, 5, 7, 8, 9, 10, 11, 20, 21, 27], "maintain": [23, 25, 27], "major": [6, 20, 21, 23, 28], "majumd": [0, 5], "make": [2, 3, 4, 5, 8, 9, 10, 19, 20, 21, 23, 25, 27, 28], "male": 5, "malici": 20, "malinowski": [], "manag": [20, 27], "manco": [0, 5, 6, 8, 9, 15, 18, 23, 28], "mancusi": 0, "mandic": 0, "maneesh": [], "mani": [2, 5, 8, 9, 11, 16, 20, 21, 22, 23, 24, 28], "manifold": 16, "manilow": 0, "mannies": [], "manoj": [], "manor": 0, "manual": 11, "mao": [0, 6], "map": [2, 4, 8, 9, 11, 13, 19, 21, 26, 28], "marc": 0, "marcel": [], "marco": [0, 5, 18], "mard": 5, "margin": [11, 28], "mari": 15, "mariani": 0, "marianna": [0, 18], "marini": [], "mario": [], "mark": [0, 8, 9, 13, 21], "markdown": [], "markdown2": [], "marker": 16, "markerfacecolor": 16, "markers": 16, "markov": 19, "markupsaf": [], "mart": 0, "martin": 0, "martiro": 0, "marvin": [], "mask": [0, 2, 4, 13, 18, 22, 28], "maskgit": 0, "massachusett": [0, 17, 27], "massiv": [0, 6, 21], "masterpiec": 4, "match": [12, 19, 21, 26, 27, 28], "matena": [0, 18], "materi": 15, "mateusz": [], "math": [11, 19], "mathbb": 11, "mathbf": 11, "mathcal": [9, 11, 28], "mathemat": [19, 22], "mathew": [], "mathews1969technologi": [], "mathrm": 11, "mathur": [], "matplotlib": 16, "matrix": 12, "matt": 0, "matthew": 0, "matthia": [], "mauricio": 0, "mauro": [0, 5, 18], "max": [10, 24, 28], "max_length": [4, 24], "maxim": [19, 28], "mayb": 2, "mb": [], "mbl10": [], "mbqf21": [0, 8, 9, 18], "mbqf22": [18, 28], "mbqf22a": 0, "mbqf22b": [0, 16], "mc": 24, "mcaulei": [0, 5, 9, 15, 17, 18], "mccann": [], "mcfee": [], "mckee": [0, 5], "mcleavei": [], "mcy": [], "mdit": [], "mdurl": [], "me": [24, 28], "me14": [], "mean": [0, 2, 5, 6, 8, 9, 16, 17, 19, 24, 26, 27, 28], "meaning": [6, 27, 28], "measur": [4, 6, 12, 24, 25, 26, 28], "mechan": [2, 4, 8, 14, 21, 25, 27, 28], "media": [4, 11], "median": 26, "medic": [], "medium": 2, "meet": 26, "megan": [0, 18, 25], "mehri": [0, 17], "mehrish": [], "mei": 0, "meinard": [], "mel": [11, 14, 21], "melanchol": [24, 28], "melechovski": [0, 5], "melgan": [], "melod": 13, "melodi": [2, 4, 5, 13], "member": 15, "memcnn": [], "memo": [], "memori": [], "meng": [], "menghan": [], "mengji": [0, 6], "mengy": [], "menick": [], "mention": 16, "mert": [4, 28], "mesmer": [4, 24], "meta_db": 24, "metadata": [2, 16, 23, 24, 27, 28], "metal": 28, "meteor": 6, "meter": [], "method": [0, 2, 3, 8, 9, 12, 14, 16, 17, 18, 19, 21, 22, 28], "methodologi": [3, 15, 17, 18], "metric": [0, 3, 12, 18, 24, 26], "metzler": [], "mexico": [0, 5, 8, 9], "mfmw24": [], "mgg": [0, 2, 5], "mha": [], "mi": [], "miccai": [], "micha": 0, "michael": [0, 18], "michal": [], "micha\u00ebl": [], "michel": 0, "michigan": 15, "micro": [], "midi": [], "midinet": [0, 13], "might": [2, 4, 13, 19, 28], "migneco": [0, 9], "mihir": [], "miika": [], "mike": [], "mikhail": [], "mildenhal": [], "miller": [0, 17], "million": [5, 19], "mimic": 8, "min": [0, 24], "ming": [0, 17, 18, 28], "mingbo": [], "minghui": [], "mingni": [0, 6], "minguk": [], "mingz": [], "mini": [19, 28], "minim": [11, 19, 28], "minimum": 28, "minor": 10, "minz": [0, 4, 5, 9, 18, 23, 24, 28], "mir": [2, 3, 4, 9, 15, 16, 24], "mir_ev": [], "mishkin": [0, 18, 28], "misinform": 20, "mislead": 12, "mismatch": 2, "miss": [0, 3, 23, 25, 26], "mission": 20, "mit": [5, 17], "mitsubishi": 15, "mitsufuji": [0, 6], "mix": [6, 8], "mixtur": 8, "mixup": [0, 18], "mjxz23": [0, 13], "mkg": [0, 13, 17], "mlm": 21, "mlp": 11, "mlvalimaki23": [], "mlx": [], "mm24": [0, 2], "mmm": [], "mo": [], "modal": [0, 5, 8, 12, 13, 15, 19], "mode": [2, 19], "model": [0, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 20, 23, 25, 27], "model_config": 10, "moder": 11, "modern": [11, 21, 23, 24], "modifi": [0, 8, 23], "modul": [2, 4, 10, 11, 12, 14, 16, 24], "modulenotfounderror": [4, 10, 16, 24], "moham": [0, 17], "mohammad": [0, 17], "mojtaba": 0, "mokadi": [], "mold": 2, "molei": [], "molin": [], "monica": 0, "monitor": [0, 20], "mood": [2, 5, 9, 16, 17, 23, 27, 28], "moon": [], "moonseok": [], "moor": [], "mor": [0, 17], "more": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28], "morgan": [], "morph": 2, "morton": [0, 9], "mosseri": [], "most": [2, 4, 7, 8, 10, 11, 12, 16, 17, 19, 21, 22, 24, 28], "mostafa": [0, 18], "mostli": 23, "motion": [], "motiv": 11, "move": [4, 25], "mpmath": [], "mpt": 5, "mqa": [6, 8, 9], "mrr": 26, "msci": 15, "msd": 28, "msdm": 2, "msn24": [0, 28], "mssr23": [0, 5], "mtc": [], "mtg": 5, "mtp": [0, 2], "mtt": [4, 24], "mu": 5, "mucap": 5, "much": [6, 8, 9, 10, 11, 21, 23], "muchomus": [0, 6], "muedit": 5, "muhammad": [], "mul": 10, "mulab": [4, 15, 24], "mulan": [0, 16, 18, 28], "mulap": 16, "mullama": [8, 9], "muller15": [], "multi": [0, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 25], "multiclass": 27, "multidict": [], "multilabel": [27, 28], "multimedia": [0, 17, 28], "multimod": [0, 3, 6, 9, 15, 18, 20, 28], "multipart": [], "multipl": [0, 6, 8, 9, 11, 16, 18, 19, 20, 21, 23, 25, 26, 27, 28], "multitask": [0, 4, 18], "multitrack": [0, 13], "murata": 0, "murtadha": [], "muscap": [0, 8, 9, 18], "muse": [], "musegan": [0, 13], "musemorphos": [], "music": [0, 3, 4, 5, 6, 12, 13, 14, 16, 19, 21, 22, 23, 24, 25, 26, 27, 28], "musicbench": 5, "musiccap": [0, 4, 5, 8, 18, 24], "musiccaptioningmodel": 4, "musicfm": [4, 24], "musicfm_emb": [4, 24], "musicgen": 13, "musicgenerationtempl": [], "musichifi": 0, "musician": [2, 5], "musicinstruct": 5, "musicldm": [0, 11, 13, 18], "musiclm": [0, 5, 13, 18], "musicmagu": [0, 2], "musicqa": 5, "musictextclip": 5, "musictextdataset": [4, 24], "musicva": 13, "musika": [], "musilingo": [0, 5, 8, 9], "must": [4, 23, 25, 28], "mustango": [0, 2, 5], "mvdream": [], "mwd": [0, 5, 23], "mwpt18": [], "mwpt19": [0, 17], "my": [], "n": [0, 4, 5, 6, 8, 10, 11, 18, 22, 24, 28], "n_compon": 16, "n_step": 10, "naacl": [0, 5, 8, 9], "nabla_": 11, "naeem": [], "nah": [], "naik": [], "nal": [0, 17], "nam": [0, 8, 9, 15, 17, 18, 23, 25, 28], "namburi": [0, 5, 9], "name": [4, 8, 10, 13, 15, 16, 24, 27, 28], "nameerror": 4, "nan": [0, 18], "nanxin": [], "naoki": 0, "narang": [0, 18], "narrow": 23, "nash": 0, "natalia": [], "nataniel": [], "nathan": [0, 8], "nathana\u00e3": [], "nation": 20, "nativ": [], "nattanat": [], "natur": [0, 2, 4, 5, 7, 8, 9, 13, 17, 18, 19, 22, 23, 24, 25, 27, 28], "navercorp": 15, "navig": 7, "navonil": [0, 5], "na\u00efv": 21, "nc": [5, 24], "ncl": [0, 17, 18], "ncsoft": 15, "nd": 5, "nearli": 11, "necessari": 26, "necessarili": [12, 19], "necessit": 20, "need": [0, 4, 8, 10, 12, 15, 16, 17, 19, 20, 21, 23, 24, 25, 26, 27], "neg": 26, "neil": [], "nessler": 0, "net": [0, 8, 11], "network": [0, 8, 9, 13, 14, 17, 18, 19, 21, 22, 28], "networkx": [], "neuraip": [], "neural": [0, 8, 9, 12, 13, 17, 18, 19, 21, 22, 28], "neurip": [0, 15], "neurocomput": [], "never": 28, "new": [0, 2, 4, 9, 11, 13, 14, 15, 17, 18, 19, 23, 24, 27, 28], "newcom": 3, "newer": [6, 8], "newman": [], "next": [4, 8, 9, 10, 11, 14, 15, 19, 21, 22], "nez": 0, "nezhurina": [0, 18], "nfeld": [], "nice": 11, "nichol": 0, "nichola": [0, 18], "nick": [], "nickson": 0, "nicola": [], "nie": [], "nielsen": 0, "nieto": [0, 28], "night": 24, "nikhil": [], "niki": 0, "nikita": [0, 8], "nikolau": [], "niru": [], "nistal": 0, "nlp": [3, 6], "nm": 24, "nmbkb24": [2, 18], "nmbkb24a": [0, 2, 11], "nmbkb24b": 0, "nn": [4, 24], "nniemi": [], "no_grad": [4, 16, 24], "noam": [0, 17, 18], "nois": 11, "noise2": [], "noisi": [11, 16, 28], "non": [6, 8, 11, 13, 20], "none": 4, "nonequilibrium": [], "nonsens": 19, "norberto": [], "norm": [11, 16], "normal": [10, 11, 24, 26, 27], "norman": [], "norouzi": [0, 17], "notabl": [2, 11, 12, 13, 23], "notat": 11, "note": [0, 2, 4, 5, 8, 10, 11, 13, 17, 21], "notebook": [4, 10, 15, 24], "nou": [], "nouri": [], "nov": [], "novack": [0, 5, 9, 15, 18], "novack2024prestod": [], "novel": [3, 16, 18, 27, 28], "novelti": [0, 18], "novemb": 15, "now": [2, 4, 8, 9, 10, 11, 15, 19, 20, 21, 24], "np": [16, 24], "npa": [0, 2], "nsynth": [13, 17], "nuanc": [9, 16, 25, 28], "null": [], "num_epoch": [4, 24], "num_return_sequ": 4, "num_work": 24, "numba": [], "number": [2, 6, 10, 11, 12, 21, 27], "numel": [4, 24], "numer": 11, "numpi": [10, 16, 24], "nvp": [], "nzc": [0, 11], "o": [10, 16], "o1": 19, "oasi": 27, "object": [6, 12, 14, 19, 28], "obtain": [5, 6, 8, 9, 12, 15, 19], "occupi": 7, "occur": 11, "octob": [0, 6, 8], "od": 0, "off": [9, 11], "offer": [11, 12, 15, 18, 25, 28], "often": [2, 5, 8, 9, 12, 16, 19, 20, 22, 23, 28], "oh": [], "oi": 0, "olaf": [], "older": 2, "olivi": [], "olv18": [0, 28], "omer": [], "ommer": [], "omran": [], "onc": [10, 11, 19], "one": [2, 5, 8, 9, 11, 14, 16, 17, 18, 19, 20, 21, 22, 26, 27, 28], "ones": [4, 21], "ongo": 20, "onli": [2, 4, 5, 6, 8, 9, 11, 15, 17, 19, 21, 25, 26, 27, 28], "onlin": [15, 24], "onto": 21, "ontologi": [], "ontrol": [], "oor": 0, "oord": [0, 17, 28], "oov": 27, "open": [0, 2, 6, 11, 19, 20, 23, 27, 28], "openai": [0, 13, 15, 18, 19], "openli": 19, "openmu": [0, 6], "openreview": [0, 8], "oper": [2, 3, 11, 23, 27], "opera": [4, 24], "operat": 4, "operatornam": 8, "opportun": [23, 25], "optim": [0, 2, 4, 18, 19, 22, 24, 28], "option": [8, 10, 19], "orama": [0, 28], "oran": [], "orchestr": 4, "orchestra": 4, "order": [4, 8, 9, 11, 21], "org": [0, 5, 6, 8, 9], "organ": [0, 8], "organis": 7, "orient": [], "origin": [2, 4, 6, 11, 12, 13, 19, 21, 24], "orio": [], "oriol": [0, 17, 28], "orjson": [], "orthogon": [16, 19], "oscar": [0, 17], "other": [0, 2, 4, 5, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 28], "otherwis": 19, "our": [3, 8, 10, 11, 15, 19, 27], "out": [2, 3, 4, 8, 10, 11, 16, 18, 21, 24, 28], "outlier": 26, "output": [2, 4, 6, 8, 9, 10, 11, 12, 13, 17, 21, 22, 27, 28], "outsid": [15, 27], "ouyang": [0, 18], "over": [8, 11, 17, 21, 22, 25, 28], "overal": [2, 4, 6, 7, 8, 10, 11, 12, 19, 26], "overcom": 3, "overhead": 27, "overlap": 6, "overli": 21, "overview": [3, 8, 15, 22], "owj": [0, 18], "own": [4, 20], "p": [0, 4, 8, 9, 11, 17, 22, 24], "p310": [10, 15], "p_": [8, 11, 22], "p_0": 11, "p_1": 6, "p_i": [], "p_n": 6, "p_t": 11, "pablo": 0, "pachet": 0, "packag": [4, 10, 24], "pad": [4, 10, 24], "pad_token": 4, "pad_token_id": 4, "page": [3, 15], "pai": 11, "pair": [2, 4, 5, 7, 8, 12, 16, 19, 21, 28], "palett": [], "pamela": [0, 18, 28], "pan": [], "panda": 5, "pandei": [], "pandora": 15, "panel": [], "pann": [0, 12], "panorama": [], "paper": [15, 16, 19, 21], "paradigm": [4, 7, 8, 11, 13], "paragraph": 19, "parallel": [10, 15, 21], "param": [4, 24], "paramet": [4, 8, 10, 11, 19, 22, 24, 28], "parameter": 11, "pardo": 0, "pareto": 19, "pari": [], "parikh": 0, "park": [0, 18, 28], "parker": 0, "parma": [4, 24], "parmaet": 22, "parmar": 0, "pars": [6, 11], "parser": [], "parso": [], "part": [4, 7, 8, 11, 14, 17, 19, 21, 22, 24, 28], "partial": [6, 7, 10], "particip": [12, 15], "particular": [2, 4, 15], "particularli": [2, 5, 7, 8, 12, 13, 15, 17, 26, 27, 28], "partit": [], "partli": 21, "pasini": 0, "pass": [4, 8, 11], "passion": 15, "past": 21, "patashnik": [], "patch": 11, "patchifi": 11, "path": [5, 11], "pathak": [], "pathtool": [], "patrick": [0, 9], "pattern": [0, 5, 17, 19, 23, 26], "paul": [0, 17], "pauli": [], "pave": [13, 15], "pavlov": [], "payn": [0, 18], "pbar": [4, 24], "pcws22": [], "pd": 5, "peak": 10, "pedalboard": [], "peebl": 0, "peeter": [0, 9], "peizhao": [], "penalti": 6, "peng": [], "pengi": [0, 8], "peopl": [16, 23, 28], "per": [11, 19, 24, 28], "perceiv": [12, 28], "percept": 12, "perci": 0, "percuss": 15, "pereira": [0, 25], "perez": [], "perfect": [19, 24, 26], "perfecto": [0, 9], "perform": [0, 4, 8, 9, 11, 12, 13, 15, 16, 19, 24, 25, 26, 28], "perhap": [11, 23], "period": 2, "perplex": 16, "perraudin": [0, 8], "person": [], "personalis": 7, "perspect": [8, 12], "pertin": 18, "peter": [0, 18, 28], "pexpect": [], "pgpf23": [], "pgxh23": [], "ph": 15, "phase": [17, 18], "phbd03": [0, 9], "phd": [0, 15, 27], "phil": [], "philip": [0, 5, 23], "philipp": 0, "phillip": [], "philosophi": 8, "photorealist": [], "photoshopgenerativefil": [], "phrase": [16, 28], "physic": 15, "piano": [0, 15, 16, 28], "pianotre": 0, "pianotreeva": 13, "pick": [10, 19], "piec": [4, 9, 12, 17, 24, 26, 27, 28], "pierr": [], "pieter": 0, "pietquin": [], "pillow": [], "ping": [0, 6], "pink": 13, "pinkl": [0, 8], "pip": [4, 10, 15], "pipelin": 8, "pitch": [0, 13, 17], "pixel": 19, "pjbm22": [], "plai": [4, 10, 19], "plakal": [], "plan": 20, "platformdir": [], "platt": [], "play": 5, "playback": 7, "playground": [], "playlist": [0, 8, 9, 18, 25], "pleas": [3, 17], "plot": 19, "plotli": [], "plt": 16, "plugin": [], "plumblei": 0, "pmlr": [0, 17, 28], "point": [11, 12, 19, 26], "polici": 19, "polit": 20, "polosukhin": 0, "polyak": [0, 17], "polyffus": 0, "polyfuss": 13, "polyphon": 0, "pon": 0, "pooch": [], "pool": 0, "poor": 12, "pop": [0, 24, 27], "popular": [4, 7, 8, 19], "poria": [0, 5], "posit": [5, 7, 16, 21, 26, 28], "possess": 16, "possibl": [3, 7, 8, 9, 10, 12, 13, 19, 21, 27, 28], "possibli": 5, "post": [11, 19], "post0": [], "post1": [], "post2": [], "posterior": [], "postolach": 0, "potenti": [8, 12, 15, 18, 20], "power": [0, 4, 5, 8, 16, 19, 24, 28], "pp27": [], "pp32": [], "pp33": [], "ppo": 19, "prabhudesai": [], "practic": [2, 3, 7, 8, 11, 15, 18, 19, 21, 22, 26, 28], "practition": [26, 28], "prafulla": [0, 18], "pre": [0, 4, 5, 6, 8, 9, 11, 16, 18, 19, 27], "preced": [14, 21, 28], "precis": [6, 8, 15, 19], "predefin": [6, 9, 26, 28], "predict": [0, 9, 11, 13, 14, 17, 19, 21, 26, 27, 28], "predominantli": 23, "preechakul": [], "prefer": [0, 15, 18, 19, 23, 25], "prefigur": [], "prefix": [0, 4, 8, 11, 14, 19, 22], "prefix_length": 4, "prefix_mask": 4, "prefix_project": 4, "prem": 0, "prepar": 3, "preprint": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "preprocess": 14, "present": [7, 9, 11, 12, 15, 18, 23, 25, 28], "preserv": 4, "press": 0, "presto": [0, 15], "pretext": 19, "pretrain": [0, 9, 12, 14, 15, 16, 19, 24, 28], "pretti": 21, "prevent": 20, "previou": [8, 19, 21, 23, 25, 28], "previous": [11, 15, 25, 27], "primari": 9, "primarili": [13, 15, 17, 23, 28], "principl": [4, 13], "print": [4, 24], "prior": [2, 9], "pritch": [], "privat": 8, "pro": 24, "probabilist": 0, "probabl": [0, 8, 11, 13, 14, 19, 21, 22], "probe": 19, "problem": [16, 17, 18, 19, 20, 21, 23, 28], "problemat": 27, "proc": [0, 9], "proccess": 11, "proce": 11, "procedur": [], "proceed": [0, 8, 9, 11, 18, 25], "process": [0, 2, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "prod_": [8, 9], "produc": [4, 5, 7, 8, 9, 19], "product": [0, 12, 15, 18, 19, 28], "program": [], "progress": [4, 5, 18, 19, 20], "progressbar": [], "proj": 11, "project": [4, 8, 11, 13, 21, 24, 28], "promin": [3, 24], "promis": [8, 19, 20, 25], "prompt": [0, 5, 10, 11, 18, 19], "prone": 19, "pronounc": 6, "propag": [0, 17], "propcach": [], "properti": [4, 11, 24, 28], "proport": [19, 26], "proportion": 27, "propos": [9, 28], "protobuf": [], "protocol": 6, "proven": 22, "provid": [3, 5, 6, 7, 9, 10, 11, 12, 15, 19, 20, 21, 22, 23, 26, 27, 28], "proxim": 19, "pschluter22": [], "psdv": [], "pseudo": [0, 8, 16, 18], "psk": [0, 2], "psutil": [], "psycholog": [], "pt": [4, 24], "ptyprocess": [], "public": 10, "publish": [0, 15, 23], "puckett": [0, 17], "puhrsch": [], "pull": 2, "pumarola": [], "pure": [11, 18, 19, 25, 28], "purpos": [7, 8, 12, 13, 21], "push": 28, "put": [2, 4, 21], "pw": [0, 18, 28], "px23": [0, 11], "py": [4, 10, 24], "py2": [], "py3": [], "pycpars": [], "pycr": [], "pydant": [], "pydantic_cor": [], "pydub": [], "pygment": [], "pyloudnorm": [], "pynndesc": [], "pypars": [], "pyplot": 16, "pystoi": [], "python": [4, 15, 24], "python3": [4, 10, 24], "python_multipart": [], "pythonhost": [], "pytorch": [4, 24], "pytorch_lightn": [], "pytz": [], "pyviz": [], "pyviz_comm": [], "pywavelet": [], "pyyaml": [], "pzh": [], "q": [8, 14, 26], "qa": [], "qi": [0, 9], "qian": [], "qiao": 0, "qifeng": [], "qing": [], "qingq": [0, 5, 18, 28], "qiuqiang": 0, "qiyang": [], "quad": 11, "quadrat": 11, "qualit": 23, "qualiti": [0, 6, 13, 17, 18, 19, 26], "quantiz": [11, 14, 19], "queen": 15, "queri": [0, 5, 7, 8, 9, 14, 16, 18, 19, 24, 25, 26, 27], "query_vector": 24, "question": [0, 3, 5, 6, 8, 11, 19, 22], "quick": [10, 19], "quickli": [4, 19, 26], "quinton": [0, 6, 8, 9, 15, 18, 28], "quit": [2, 11], "qun": [], "quoc": [0, 18], "quot": 2, "qwen": 8, "r": [0, 11, 15, 28], "r_1": 6, "rachel": [0, 8, 9, 18], "racial": 20, "radford": [0, 4, 18, 28], "radio": 28, "radiohead": 28, "radlinski": [0, 18, 25], "raffel": [0, 18], "rag": 20, "rai": [0, 18], "ram": 0, "ramalingam": [], "ramesh": [0, 28], "ramsauer": 0, "randn": 10, "random": [4, 24], "random_st": 16, "randomnam": [], "rang": [2, 4, 7, 8, 17, 18, 19, 22, 23, 24, 26, 27, 28], "rank": [19, 26, 27], "rank_q": 26, "rao": [], "rap": 16, "rapha": [], "raphael": [], "rapid": 15, "rapidli": 13, "raquel": [], "rare": [2, 21, 26, 28], "rashindra": [], "rate": [4, 10, 11, 12, 19, 24], "rather": [4, 8, 11, 12, 16, 17, 23, 27, 28], "rave": [0, 2], "ravi": [0, 18, 25, 28], "raw": [0, 11, 17, 24, 27, 28], "raymond": [0, 9], "rbl": [], "rdn": [], "re": [0, 3, 4, 7, 10, 11, 17, 22, 23, 24], "reach": 3, "reaction": 23, "readabl": 7, "reader": 11, "readi": 24, "real": [0, 2, 9, 12, 13, 15, 17, 19, 20, 23, 26, 27, 28], "realis": 8, "realist": [], "realiti": 16, "realiz": 19, "realli": [11, 19, 21, 22], "realm": [4, 24], "reaon": 12, "rearrang": [10, 21], "reason": [2, 9, 11, 20], "recal": 6, "receiv": [8, 19], "recent": [4, 7, 8, 9, 10, 13, 15, 16, 19, 20, 21, 22, 23, 24, 25, 28], "reciproc": 26, "recogn": [12, 19, 27], "recognis": [], "recognit": [0, 5, 9, 17, 27], "recommend": [0, 4, 5, 7, 10, 12, 17, 23, 25], "reconstruct": 14, "record": [], "recurr": [0, 9, 13, 22], "red": 21, "reduc": [6, 19, 20, 28], "refer": [0, 2, 11, 15, 19], "referenc": [], "refin": [23, 25], "reflect": [9, 12, 13, 19, 26, 28], "refram": 8, "regard": [7, 19, 27], "regardless": 28, "regener": 2, "regex": [], "regina": [], "region": [2, 11], "regress": [14, 18, 19], "regular": [11, 26], "reinforc": 19, "reinvent": 11, "reiss": [], "rel": [9, 16, 17], "relat": [15, 16, 19, 20, 23, 27, 28], "relationship": [6, 9, 17, 21, 25, 28], "relax": 5, "releas": 21, "relev": [11, 15, 18, 19, 24, 25, 26, 27, 28], "reli": [6, 8, 12, 13], "reliabl": [6, 26], "relianc": 28, "religi": 20, "relu": [4, 24], "remain": [8, 11, 16, 20, 23, 25, 28], "remark": 15, "remedi": 19, "remez": [0, 18], "remi": 13, "remind": 28, "remov": 11, "ren": [], "render": 10, "renum": 5, "renumics___song": [], "rep": [], "repeat": 5, "repeatedli": 23, "repetition_penalti": 4, "replac": [11, 19], "repo": [], "report": [0, 16, 18], "repositori": 15, "repres": [7, 12, 13, 16, 17, 19, 21, 25, 28], "represent": [0, 4, 6, 8, 12, 15, 17, 19, 28], "repurpos": 2, "request": [16, 23], "requir": [2, 7, 8, 11, 12, 15, 19, 20, 23, 25, 26, 27, 28], "requires_grad": [4, 24], "requires_grad_": 4, "rer": [0, 13], "resampi": [], "research": [0, 3, 9, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28], "reshap": 4, "residu": [11, 14], "resize_token_embed": 4, "resnick": [0, 17], "reso": [], "resolut": [2, 11, 14], "resort": 21, "resourc": [7, 10, 19], "respect": [7, 8, 9, 28], "respons": [0, 5, 8, 9, 12, 16, 19, 22, 23, 25], "respos": 8, "rest": [10, 11], "restrict": [8, 26, 27, 28], "result": [6, 8, 11, 12, 17, 19, 23, 26, 27, 28], "retain": 8, "rethink": [], "retriev": [0, 3, 6, 7, 8, 9, 13, 15, 16, 22, 26, 28], "retrieval_fn": 24, "return": [4, 5, 24, 26, 27], "return_tensor": [4, 24], "reus": 19, "reveal": 23, "revers": [0, 11], "review": [0, 3, 5, 6, 7, 8, 9, 11, 16, 18, 23, 28], "revisit": 23, "reward": 19, "rewon": [0, 18], "rez": 0, "rfb15": [], "rfer": [], "rg": [], "rgy": [0, 8, 9, 18, 28], "rhythm": [0, 2, 5], "rhythmic": 2, "ricardo": [], "riccardo": [], "rich": [16, 28], "richard": [], "richardson": [0, 9], "richer": [17, 19, 23, 28], "rif": [0, 13], "riff": [0, 2, 5, 24], "riffus": [0, 13], "rigel": [], "right": [5, 8, 11, 20], "rightarrow": [9, 11], "rinon": [], "rise": 17, "risk": 20, "rita": [0, 8], "rithesh": [0, 17], "ritter": [], "rkh": [0, 16, 28], "rkx": [], "rlhf": 19, "rlj": [], "rm": 19, "rmh": [], "rn": [], "rnn": [0, 8, 13, 21, 22], "rob92": [], "robbin": [], "robert": [0, 5, 17, 18], "roberta": [0, 21, 24, 28], "roberta_emb": 24, "robin": [], "roblek": 0, "robust": [25, 26, 28], "robustli": [0, 24, 28], "rock": [0, 2, 4, 5, 16, 17, 18, 24, 27, 28], "rod": [], "rodol": 0, "roform": [], "roger": [0, 8], "role": 18, "roll": 5, "romain": [0, 8, 9], "rombach": [], "ron": 0, "rongchen": [0, 5, 8, 9], "rongji": [], "ronneberg": [], "room": 2, "root": [4, 24], "roshan": [], "rot92": [], "rotari": [], "rothstein": [], "roug": 6, "rouge_l": [], "rough": 11, "round": 10, "rout": 8, "roux": 0, "rovan1997igm": 0, "row": 21, "rpd": [], "rpg": [], "rsr": [0, 18], "rubinstein": [], "ruff": [], "rui": [], "ruihan": 0, "ruiz": [], "rule": 0, "run": [2, 10, 11, 15], "runner": [], "runtim": [4, 24], "runtimeerror": [], "russel": [0, 5], "rvq": 14, "rvqgan": 0, "rwc": [0, 18], "rwd97": [0, 13], "rxl": [], "s3f": [], "s4": 13, "s41592": [], "s_": 11, "sa": 5, "sabet": [], "sabour": [], "sadeep": [], "safehttpx": [], "safer": 20, "safetensor": [], "sageev": 0, "saharia": [], "sai": [8, 10, 21, 23], "sain": 0, "saito": [], "sakkeer": [0, 5, 8], "sal": [], "salamon": [0, 5, 28], "salient": [2, 8, 15], "saliman": 0, "salmonn": [0, 8], "sam": [0, 18], "same": [8, 11, 12, 16, 19, 20, 21, 28], "sameer": 0, "sami": [], "sampl": [2, 4, 10, 11, 12, 19], "sample_r": 10, "sample_s": 10, "sampler": 10, "sampler_typ": 10, "samplernn": [0, 13, 17], "samuli": [], "san": 15, "sanakoyeu": [], "sander": [0, 11, 17], "sandhini": [0, 18, 28], "sandler": [0, 8, 9], "sang": [], "sanja": [], "sanjiv": [], "saroufim": [], "sashimi": 13, "sastri": [0, 28], "satisfact": [16, 26], "satisfi": [23, 26], "sauer": [], "saurabh": [], "saurou": [], "save": [10, 19], "savitzki": [], "saw": 19, "saxena": [], "saxophon": 28, "sbd": [], "sbr22": [], "sc": [], "scalabl": 0, "scale": [0, 8, 9, 10, 11, 12, 18, 28], "scatter": 16, "scc": [], "scdbk24": [0, 8], "scenario": [15, 23, 26, 28], "scene": [6, 11], "sch": [], "schedul": [], "schelten": [], "scheme": [8, 21], "schl": [], "schmidt": [0, 9], "schneider": [], "schoenfeld": [], "schulman": [], "schuster": [], "scienc": [0, 15, 23], "scientif": [], "scikit": [], "scikit_imag": [], "scikit_learn": [], "scipi": [], "scope": [16, 26, 28], "score": [0, 6, 8, 11, 19, 23, 26, 27], "scoroda18": [], "scott": [0, 9], "scratch": [2, 19, 23], "sd": 11, "sdcs23": [], "sdd": 5, "sde": [10, 11], "sdk": [], "sdwmg15": [], "search": [7, 15, 17, 18, 19, 23, 24, 25, 27, 28], "seb": [], "sebastian": [], "sec": [], "second": [0, 4, 8, 10, 11, 12, 17, 19, 21, 24, 27], "seconds_start": 10, "seconds_tot": 10, "secret": 10, "section": [4, 8, 9, 13, 14, 16, 19, 21, 22, 28], "see": [4, 6, 7, 8, 9, 11, 19, 21, 22], "seed": 10, "seek": [0, 2, 9, 11, 15, 23], "seem": 11, "seen": [2, 4, 8, 16, 19, 28], "seetharaman": 0, "segment": [9, 21, 22], "select": [12, 28], "self": [0, 4, 8, 10, 11, 15, 16, 19, 24, 28], "semant": [0, 6, 16, 17, 18, 19, 27, 28], "semantic_vers": [], "semanticscholar": 0, "semi": [0, 9], "senior": [0, 17], "sens": [11, 19, 21, 22, 24], "sensit": [12, 26], "sentenc": [6, 8, 9, 13, 16, 18, 21], "sentence_transform": 16, "sentencepiec": 21, "sentencetransform": 16, "sentri": [], "sentry_sdk": [], "seong": [], "separ": [0, 8, 11, 15, 19, 21, 27], "sepp": 0, "sequenc": [0, 8, 9, 10, 11, 19, 22, 28], "sequenti": [0, 4, 8, 24], "sergei": [], "sergio": [0, 28], "seri": [8, 11, 15], "serra": [0, 9, 28], "serv": [12, 14, 15, 16, 23, 25], "server": 15, "session": [0, 4, 8, 12], "set": [0, 5, 6, 8, 9, 10, 12, 18, 22, 25, 26, 27], "seth": 0, "setproctitl": [], "setup": 12, "setuptool": [], "seungheon": [0, 3, 5, 8, 15, 18, 23, 25, 28], "seungheond": 10, "seungjun": [], "seventh": [0, 8], "sever": [7, 8, 9, 12, 19, 23, 25, 26, 28], "sexual": 20, "seybold": [], "seyedhosseini": 0, "sfg": [], "sfjb21": [], "sfk24": [], "sft": 19, "sg64": [], "sgz": [0, 12], "sh22": [], "shan": [0, 5, 8], "shang": [], "shansong": [0, 5, 8], "shaohan": [0, 28], "shaoq": [], "shaoshu": [], "shaozh": [], "shape": [4, 24, 28], "sharan": [0, 18], "share": [8, 17, 28], "sharifi": 0, "shawn": [], "shayn": [0, 18], "shazeer": [0, 18], "she": [4, 15], "shechtman": [], "sheld": 11, "shelf": 11, "shellingham": [], "shen": [], "sheng": [], "shengfeng": [], "sherlock": [], "shi": [], "shibuya": [], "shift": [7, 9, 11, 17, 21, 28], "shih": [0, 18], "shihao": [], "shiliang": [], "shinji": [0, 18], "ship": 19, "shiqi": [0, 6], "shiran": [], "shivam": [], "shjl24": [], "shkk22": [], "shlomo": [0, 8, 9, 18, 28], "short": [0, 4, 9, 17], "shortcom": 7, "shortli": 8, "shot": [0, 9, 18, 28], "should": [4, 9, 10, 11, 15, 19, 25, 26], "show": [16, 19, 26, 27, 28], "shown": [5, 8, 13, 19, 20, 21, 28], "shrirao": [], "shu": [0, 18, 25], "shuai": [0, 28], "shubham": [0, 17], "shuffl": [4, 24], "shunt": [], "shuqi": [], "shusuk": [0, 6], "shyamal": [0, 18], "si": [], "siang": 0, "sicong": [], "siddhartha": [0, 18], "side": 2, "siggraph": [], "sigir": [0, 18, 25], "sigma_max": 10, "sigma_min": 10, "signal": [0, 4, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 28], "signatur": 13, "signifi": 15, "signific": [15, 23, 25, 27, 28], "significantli": [19, 28], "sil": [], "silenc": 10, "sim": 11, "simian": [], "similar": [0, 2, 6, 8, 11, 12, 17, 19, 21, 23, 24, 25, 26, 27], "similarity_metr": 24, "similarli": [6, 8, 28], "simon": [0, 8, 9, 18], "simonetta": [], "simonyan": [0, 17], "simpl": [0, 2, 4, 9, 13, 18, 19, 21, 25, 28], "simple_contrastive_loss": 24, "simpler": [10, 11, 21], "simplest": [2, 8, 9], "simpli": [2, 4, 8, 10, 11, 19, 22], "simplifi": [], "simultan": [0, 13, 16, 28], "sinc": [8, 11, 12, 13, 19, 21, 27], "sing": [0, 4, 15, 24], "singer": [0, 4, 5, 24], "singh": [0, 8], "singl": [0, 8, 9, 10, 11, 12, 16, 17, 18, 26, 28], "singsong": [0, 2], "siraichi": [], "site": 10, "situat": [12, 19, 22], "sivic": [0, 5], "six": [], "siyu": [], "siyuan": [], "size": [4, 5, 8, 9, 11, 19, 21, 24, 28], "sjscholkopf23": [], "sk": [], "skals": [], "sketch": 8, "sketchnet": [0, 13], "skip": 19, "skip_special_token": 4, "sklearn": 16, "skoglund": [], "skyrocket": 4, "slama": [0, 18], "slbr23": [], "slc07": [0, 17], "slightli": [8, 21, 23], "slow": [19, 24], "slowdown": 11, "small": [4, 9, 11, 12, 19, 21, 26], "smaller": [19, 28], "sme20": [], "smith": [], "smitin": [0, 2], "smmap": [], "smollei": [], "smooth": 28, "snake": 11, "sne": 16, "sniffio": [], "so": [2, 4, 6, 8, 10, 11, 15, 19, 21, 22], "so17": [0, 13], "soar": 4, "social": [0, 16, 17], "societi": [0, 6, 8, 9, 15, 18, 20, 23], "soft": [5, 16, 28], "softmax": [8, 27], "softwar": 15, "soham": [0, 8], "sohl": 0, "sojoudi": [], "solid": 15, "solo": [4, 5, 28], "solut": [3, 18, 19], "solv": [11, 16, 19, 20, 21], "solver": 11, "somayeh": [], "some": [2, 4, 5, 7, 8, 9, 10, 11, 13, 17, 19, 21, 22], "someth": [11, 19], "sometim": [8, 9, 19], "son": [], "song": [0, 2, 4, 5, 9, 10, 23, 24, 27, 28], "soni": [13, 15], "soon": 13, "sophist": [3, 4, 8, 25, 28], "soprano": 4, "sora": 19, "sordo": [0, 17], "soroush": [0, 17], "sort": [2, 11, 26], "sot": 21, "sotelo": [0, 17], "soujanya": [0, 5], "soumith": [], "sound": [0, 4, 5, 8, 9, 11, 16, 17, 18, 24, 27, 28], "soundctm": [], "soundfil": [], "soundstorm": [], "soundstream": [], "sourc": [0, 5, 15, 17, 19], "sourcetensor": 4, "sourish": [], "southern": 15, "souza": [], "space": [0, 3, 4, 8, 11, 13, 16, 19, 23, 28], "spam": 20, "span": [7, 15, 18], "spanish": 24, "speak": 11, "special": [16, 19, 21, 25, 28], "specialis": 8, "specif": [2, 3, 4, 8, 9, 12, 13, 15, 17, 18, 19, 21, 22, 27, 28], "specifi": 23, "speck": [0, 9], "spectral": 11, "spectrogram": [0, 11, 14, 21, 28], "spectrum": [13, 23], "speech": [0, 5, 8, 9, 12, 13, 17, 18, 19, 27, 28], "spend": 19, "spice": 6, "spider": 6, "spijkervet": 0, "spirit": 11, "spl": 0, "split": [4, 24], "spm": 15, "spotifi": 15, "springer": [0, 28], "sqrt": 8, "squar": 21, "sr": 10, "src": 5, "srikumar": [], "srivatsan": [0, 8], "ssdk": [0, 11], "ssw": 0, "stabil": [10, 15], "stabilityai": 10, "stabl": [0, 2, 11, 16], "stable_audio_tool": 10, "stableaudio": 13, "stack": 11, "staff": 15, "stage": [4, 8, 19], "standard": [2, 4, 6, 8, 11, 21, 22, 26], "stanlei": [], "starlett": [], "start": [8, 9, 10, 11, 13, 17, 21, 22, 23], "startup": 13, "stasyuk": [], "state": [0, 2, 8, 9, 13], "static": [6, 9, 15, 19, 28], "statist": 12, "steadi": 20, "steer": 23, "steerabl": 0, "stefan": 0, "stefano": 0, "steinmetz": [], "stem": [2, 28], "stemgen": [0, 2], "step": [0, 6, 8, 10, 11, 12, 14, 19, 20, 22, 26], "stephen": [0, 17], "stereo": 0, "steven": [0, 5, 8, 9, 18], "stft": [11, 14], "still": [8, 12, 16, 19, 20, 21, 23], "stimulu": 12, "stitch": [], "stochast": [0, 11], "stoi": [], "stoller": [0, 8, 9, 18], "stop": 10, "store": [], "stori": 24, "straight": 19, "straightforward": [12, 21], "strategi": [0, 6, 18, 28], "strength": [11, 12], "strictli": 10, "string": [4, 11, 24], "strong": [2, 20, 24], "stronger": 28, "strongli": [2, 6], "strub": [], "struct": 7, "structur": [0, 2, 9, 13, 19, 23, 27, 28], "struggl": [19, 23, 28], "strum": 24, "student": [15, 19], "studi": [3, 9, 11, 15, 17, 23, 27, 28], "style": [2, 9, 11, 13, 17, 23, 28], "su": [], "sub": [9, 21, 22], "subject": [0, 5, 6, 12], "sublinear": [], "submit": 19, "subscript": 22, "subsequ": [6, 13, 23], "subset": [4, 21, 24, 27], "substanti": [15, 28], "substr": 22, "subtl": [12, 28], "subword": [21, 28], "succeed": 3, "succes": 11, "success": [8, 21, 22, 25], "suffici": [19, 25, 27, 28], "suggest": [4, 8, 10, 23, 25], "suha": [], "suhail": [], "suitabl": [5, 8, 9, 12], "suk": [], "sum": [4, 24], "sum_": [8, 26, 28], "sumbali": [], "summar": [12, 19], "summari": [7, 9], "summaris": 6, "summit": 20, "sun": [0, 5, 8, 13], "sungroh": [], "sunni": [5, 28], "suno": [0, 13], "suo": [], "supasorn": [], "super": [4, 24], "superior": 12, "supervis": [0, 8, 9, 15, 17, 18, 19, 21, 27, 28], "supplement": [5, 15, 24], "suppli": 21, "support": [7, 10, 17, 20, 23, 24, 25], "suppos": 21, "sure": [4, 10, 24], "surround": [19, 21, 22, 28], "survei": [0, 15, 17], "surya": [], "sustain": 25, "sutskev": [0, 18], "suttisak": [], "suwajanakorn": [], "svn37": [], "swap": [0, 28], "swave": [], "sweep": 10, "sweet": 3, "swerski": [], "swiss": [0, 6], "swy": [], "sylvain": [], "symbol": [0, 13], "sympi": [], "synchron": [0, 18], "synnaev": [0, 18], "syntact": 6, "synth": 5, "synthes": 13, "synthesi": [0, 13, 17], "synthet": [0, 7, 25], "system": [0, 2, 4, 6, 7, 8, 9, 10, 12, 16, 17, 18, 19, 23, 24, 26, 27, 28], "szk": [], "szu": [0, 17, 18], "t": [0, 2, 3, 5, 6, 8, 9, 10, 11, 16, 18, 19, 21, 23, 24, 27, 28], "t1": 10, "t5": [11, 14, 16, 19, 21, 22], "t_i": 28, "t_j": 28, "tabl": [8, 19, 21], "tackl": [8, 28], "taehong": [], "taesu": [0, 23, 25], "taesung": [], "tag": [0, 2, 4, 5, 9, 16, 17, 18, 24, 27], "tagliasacchi": [0, 5, 18], "tai": [0, 18], "taigman": [0, 17], "tak": 8, "takahashi": [0, 6], "takashi": [], "take": [2, 4, 6, 7, 8, 11, 14, 19, 20, 27, 28], "takida": 0, "tal": [0, 18], "talent": [4, 24], "tali": [], "talk": [0, 2, 7, 19, 25], "tallini": 0, "tan": [0, 8], "tang": [0, 6, 8], "tanh": 8, "tao": [], "tar": [], "target": [8, 11, 14, 27], "task": [0, 3, 4, 6, 7, 8, 11, 12, 13, 14, 15, 18, 20, 21, 26, 27], "taslp": 15, "tat": [], "tau": 28, "taylor": [0, 8, 15, 18, 28], "tb": 10, "tb_name": 10, "tbtl08": [0, 17, 18, 27], "tc02": [0, 9], "te_dataload": [4, 24], "tea": 5, "teach": [0, 15, 17, 18], "teacher": 19, "teboul": [], "tech": [], "technic": [0, 15, 18, 22, 23], "techniqu": [0, 8, 12, 15, 18, 21], "technologi": [0, 15, 17, 18, 23, 27], "teh": 2, "tejasvi": [0, 18, 25], "telecommun": [0, 12], "tell": 24, "temperatur": [4, 24], "templat": 5, "tempo": [5, 16, 17, 24, 27], "tempor": [0, 2, 4, 5, 9, 11], "ten": 19, "tenac": [], "tencent": 15, "tend": [2, 20, 23], "tendenc": 26, "tenenbaum": [], "tensor": [4, 24], "tensor_numpi": [], "tensorboard": [], "tensorboard_data_serv": [], "teoh": [], "ter": [], "term": [0, 2, 6, 7, 8, 9, 11, 12, 16, 17, 18, 19, 27, 28], "termcolor": [], "tero": [], "test": [4, 19, 20, 24], "test_dataset": [4, 24], "tester": 12, "teuwen": [], "text": [0, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 22, 23, 24, 26, 27], "text2song": 2, "text_embedding_dim": [4, 24], "text_encod": 24, "text_forward": 24, "text_model": 4, "text_output": 24, "text_project": 24, "text_token": [4, 24], "textrm": [11, 22], "textsubscript": 6, "textual": [0, 2, 12, 13, 18, 28], "textur": 23, "textwrap": 4, "tf": 6, "th20": [], "thabet": [], "thabo": [], "than": [2, 4, 8, 9, 11, 12, 15, 16, 17, 19, 20, 23, 27, 28], "thang": [], "thank": [16, 28], "thdl24": [0, 13], "thei": [4, 6, 7, 8, 9, 11, 15, 16, 19, 21, 22, 23, 25, 28], "them": [2, 4, 6, 8, 10, 11, 12, 13, 19, 21, 28], "theme": [17, 28], "theori": 8, "therebi": 15, "therefor": [8, 9, 12], "thereof": 8, "theres": 11, "thermodynam": [], "thesi": [0, 15, 27], "theta": [8, 11, 22], "thi": [2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "thibault": [], "thickstun": 0, "thierri": [0, 17], "thing": [2, 4, 10, 11, 20, 24], "think": [4, 9, 11, 19, 28], "third": [0, 8], "thirti": [0, 8], "thoma": 0, "those": [9, 19, 20, 21, 22], "though": [8, 9], "thread": 8, "threadpoolctl": [], "three": [5, 12, 13, 19, 21, 23], "threshold": 26, "through": [0, 2, 3, 4, 5, 7, 8, 9, 10, 13, 15, 16, 17, 18, 19, 20, 23, 24, 25, 27, 28], "throughout": 5, "throughput": 20, "thu": [11, 15], "ti": 8, "tian": [0, 8], "tianqi": [], "tianwei": [], "tianyu": [0, 28], "tie": [0, 9], "tier": 15, "tifffil": [], "tight_layout": 16, "tillet": [], "tim": 0, "timbr": [2, 16], "timbretron": [], "time": [0, 2, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 27, 28], "timeless": 4, "timelin": 13, "timescal": [8, 9], "timestep": 11, "timo": [0, 5, 18], "ting": [0, 17], "tinghui": [], "tip": 24, "titl": 16, "tl89": [0, 13], "tn": 26, "to_html": 5, "todai": [7, 8, 24], "todd": 0, "todo": [], "togeth": 28, "token": [0, 3, 4, 6, 8, 9, 10, 11, 14, 19, 20, 28], "tokenization_utils_bas": [4, 24], "tokenizers_parallel": 10, "tom": 0, "tomer": 0, "tomlkit": [], "tommi": [], "too": [4, 19, 22, 24, 26], "tool": [4, 10, 13, 18, 25], "toolkit": [], "top": [8, 15, 19, 21, 26, 28], "top_k": 4, "top_p": 4, "topic": [3, 9, 13, 15, 19, 23], "topk": 24, "torch": [4, 10, 16, 24], "torch_stoi": [], "torchaudio": [4, 10, 24], "torchdiffeq": [], "torchlibrosa": [], "torchmetr": [], "torchsd": 10, "torchvis": 10, "tornado": [], "torr": [0, 17, 18, 27], "toshimitsu": 0, "total": 10, "total_loss": [4, 24], "toutanova": [0, 18, 28], "tov": [], "tovstogan": [0, 5, 23], "toward": [0, 5, 7, 8, 9, 11, 17, 18, 20, 23], "tp": 26, "tqdm": [4, 24], "tr": [], "tr_dataload": [4, 24], "trace": [3, 7, 18], "traceback": [4, 10, 16, 24], "track": [0, 5, 8, 9, 17, 27, 28], "track2emb": [4, 24], "track_id": [4, 24], "tradeoff": 20, "tradit": [3, 13, 25, 27], "tradition": 9, "train": [0, 2, 3, 5, 6, 7, 9, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27], "train_dataset": [4, 24], "train_loss": [4, 24], "train_parma": [4, 24], "traitlet": [], "trajectori": [0, 23], "trampolin": [], "tran": 0, "transact": [0, 9, 17, 18, 27], "transcript": 15, "transfer": [0, 2, 13, 18, 21, 28], "transform": [0, 4, 5, 8, 9, 11, 13, 14, 16, 18, 19, 21, 22, 24, 28], "transit": 18, "translat": [0, 4, 6, 7, 8, 11, 17, 19], "transpar": [], "transport": 4, "treat": [3, 8, 11, 21, 23, 27], "tremend": 21, "trend": [8, 13], "tri": [], "trick": 2, "trigger": [], "triplet": [3, 28], "true": [4, 10, 16, 24, 26], "truli": [2, 4], "truncat": [4, 24], "trung": [], "trust": 24, "truth": [4, 6, 24, 26], "try": [4, 16, 22, 24], "tsa": [0, 9], "tsai": [], "tsne": 16, "tsung": [0, 9], "ttm": [2, 11], "ttmr": 16, "tu": [], "tune": [4, 19, 24], "tuoma": [], "tupl": 9, "turab": 0, "turbo": [], "turn": [9, 10, 11, 18, 21], "turnbul": [0, 9, 17, 18, 27], "tutoir": [], "tutori": [2, 4, 9, 11, 13, 15, 16, 17, 19, 21, 22, 24], "twelfth": [0, 8], "two": [0, 3, 7, 8, 12, 13, 14, 17, 21, 27, 28], "txt": 15, "ty": [0, 8], "type": [4, 5, 7, 8, 12, 13, 21], "typer": [], "typic": [6, 7, 8, 9, 12, 19, 23, 27, 28], "tzanetaki": [0, 9], "tzdata": [], "tzg": [0, 2], "u": [3, 4, 11, 19, 20, 22, 28], "uc": [], "ucsd": [], "udi": [0, 13], "udio": [0, 13], "uesaka": 0, "ugen": [], "uh": [], "ultim": 12, "umap": [], "umap_learn": [], "umbrella": [7, 8], "umg": 15, "un": 21, "unannot": 27, "unattribut": 2, "uncommon": 21, "uncondit": [0, 17], "under": [8, 15, 27], "underbrac": 22, "undergo": 8, "understand": [0, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 18, 22, 23, 25, 26, 28], "understnad": 24, "understood": 20, "unequivoc": 8, "unfamiliar": 28, "unfeas": 27, "unforgett": 21, "unfortun": 23, "uni": [], "uni01": [0, 12], "unifi": [0, 8, 16, 18], "unigram": 6, "union": 0, "uniqu": [3, 12, 20], "unit": [0, 6, 8, 9, 28], "univ": [], "univers": [0, 15, 17, 18, 28], "unknown": [21, 28], "unlabel": [7, 21], "unlik": [11, 12, 19, 21, 25, 28], "unlimit": 28, "unrel": 28, "unresolv": 15, "unrestrict": 27, "unrol": 8, "unsatisfactori": 23, "unseen": [17, 21], "unsqueez": 4, "unstabl": 2, "unsupervis": [0, 4, 18], "unterthin": 0, "until": [11, 22], "unus": 2, "up": [2, 6, 10, 11, 19, 20, 21], "upbeat": [2, 5, 23, 24, 25, 27, 28], "updat": [0, 4, 19], "upend": 2, "uplift": 5, "upload": 10, "upon": 23, "uriel": 0, "url": [0, 5, 6, 8, 9], "urllib3": [], "urtasun": [], "us": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28], "usa": 15, "usabl": 16, "usag": [17, 20, 26], "usai": [], "user": [0, 8, 9, 10, 15, 16, 18, 19, 23, 24, 25, 26, 27, 28], "userwarn": [4, 10], "usic": [], "usr": [4, 24], "usual": [6, 7, 8, 9, 22, 27], "uszkoreit": 0, "utf": 5, "util": [4, 14, 24, 28], "uvicorn": [], "v": [0, 8, 9, 11, 14, 18], "v1": [0, 8, 9], "v2": [], "v3": [], "v4": [4, 24], "v_diffusion_pytorch": [], "va": [], "vae": [0, 11, 19], "vahdat": [], "vajda": [], "valid": [4, 24], "valid_loss": [4, 24], "valu": [6, 11, 12, 14, 19, 26, 27], "valuabl": [12, 23, 25], "vampnet": [0, 2, 13], "van": [0, 17, 28], "vandergheynst": [], "vari": [0, 2, 8, 11, 18], "variabl": [9, 10, 11], "varianc": 12, "variant": [6, 7, 9], "variat": [0, 8, 9, 19, 23], "varieti": [7, 8, 16, 19, 21, 23], "variou": [5, 15, 16, 18, 19, 27, 28], "varun": [], "vascipy10contributors20": [], "vast": 28, "vastli": 21, "vasudevan": 0, "vaswani": 0, "vdodz": [0, 13, 17], "vdov": [], "ve": [2, 3, 4, 8, 24], "vector": [0, 2, 8, 11, 14, 18, 19, 24, 28], "vector_quantize_pytorch": [], "veit": [], "ventur": 2, "venv": [], "veri": [10, 19, 21, 27], "vers": 2, "versatil": [], "version": [4, 11, 12, 28], "versu": 28, "verzetti": [0, 5, 18], "vesa": [], "vggish": 12, "via": [0, 5, 6, 8, 14], "vibe": [5, 24], "vicki": 0, "vicol": [], "video": [0, 5, 7], "videocrafter1": [], "view": [19, 27], "vijai": 0, "vincent": [0, 18], "vinh": [], "vinyal": [0, 17, 28], "violin": [4, 24], "virtanen": [], "virtual": [10, 19, 28], "visheratin": [], "visio": 0, "vision": [0, 5, 6, 16], "visit": [0, 5, 6, 8, 9], "visual": [0, 16, 28], "vocabulari": [9, 16, 17, 18, 21, 28], "vocal": [2, 4, 5, 24, 28], "vocalist": [4, 24], "vocod": [0, 11], "voic": [2, 4, 24], "volkmann": [], "volum": [0, 2, 5, 8, 9, 19], "voss": [], "voznesenski": [], "vq": 19, "vqgan": 19, "vri": [], "vsp": [0, 14], "vulner": 24, "w": [0, 8, 11, 16], "wa": [4, 13, 15, 17, 19, 21, 24, 28], "wade": [], "wai": [2, 3, 5, 6, 8, 11, 13, 15, 16, 18, 19, 21, 23, 24, 27], "wainwright": [0, 18], "wakaki": [0, 6], "walk": [0, 11, 25], "wallac": [], "wandb": [], "wang": [0, 6, 8, 18], "wang_self": [], "wanmo": [], "want": [2, 4, 5, 10, 17, 19, 21, 22, 23, 24, 27], "warn": [4, 10, 16, 24], "watanab": [0, 18], "watson": [], "wattenhof": [0, 8], "wav": 5, "wav2vec2featureextractor": 4, "waveform": [11, 17, 28], "wavegan": 17, "wavenet": [0, 13, 17, 19], "wbz": [0, 18], "wcmb": [], "wcs21": [0, 9], "wcwidth": [], "wcy22": [], "wcz": [0, 12, 28], "wdwb23": [0, 2, 11, 18], "wdwb24": [], "we": [2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 27, 28], "weak": [0, 12], "weaker": 16, "web": [0, 15, 19, 28], "webdataset": [], "webencod": [], "websocket": [], "weck": [0, 5, 6, 23, 28], "weer": 0, "wei": [0, 6, 8, 9, 18, 28], "wei_finetuned_2021": [], "weigh": [8, 28], "weight": [4, 6, 8], "weihao": [], "weikang": [], "weili": [], "weiner": 11, "weird": 11, "weiss": [], "weituo": [0, 9], "welcom": [15, 24], "well": [2, 4, 6, 8, 11, 12, 15, 16, 18, 19, 20, 22], "wen": [0, 5, 9], "weng": [], "wenhao": [0, 5, 8, 9], "wenhu": [0, 5, 8, 9], "wenwu": 0, "wenyi": [0, 8], "were": [3, 6, 17, 19, 21, 22, 23, 27, 28], "werkzeug": [], "wgen23": [], "wget": [], "wgn23": [], "whang": [], "what": [8, 9, 11, 19, 21, 23, 27], "whb": [], "when": [3, 6, 7, 8, 9, 12, 14, 18, 19, 21, 22, 23, 26, 27, 28], "whenev": 17, "where": [2, 4, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28], "wherea": 22, "whether": [9, 17, 19, 26], "whi05": [0, 27], "which": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 26, 28], "whiil": 2, "while": [2, 8, 11, 12, 15, 16, 19, 20, 21, 23, 25, 28], "whisper": [15, 19, 21], "whistl": 5, "whitman": [0, 17, 27], "whl": [], "who": [2, 4, 24, 28], "whole": [9, 19], "wht24": [], "why": [11, 15, 19, 24], "wichern": 0, "wide": [7, 8, 12, 16, 17, 18, 19, 21], "wider": [11, 16], "widmer": [], "width": [4, 11], "wikimut": [0, 5, 28], "wikipedia": 5, "wilei": [], "william": [0, 18], "wimbauer": [], "winter": [], "wise": [2, 11], "wish": [3, 8, 11], "within": [2, 4, 8, 9, 16, 26, 28], "without": [2, 6, 8, 13, 16, 18, 19, 23, 28], "wizadwongsa": [], "wjt": [], "wkgs24": [0, 28], "wmb": [0, 6], "wmd": [], "wnn": [0, 5, 9], "wojciech": 0, "wolf": [0, 17], "womanli": 4, "women": 4, "won": [0, 5, 9, 10, 18, 23, 28], "wong": [], "wook": [0, 15, 18, 28], "word": [0, 2, 11, 16, 17, 18, 19, 21, 22, 27, 28], "work": [2, 3, 8, 9, 11, 13, 15, 19, 21, 22, 24, 25, 28], "workflow": [2, 15], "workshop": [0, 8], "world": [4, 9, 15, 17, 19, 23, 26, 27, 28], "worst": 19, "worth": 8, "would": [9, 11, 23, 27, 28], "wrap": [2, 4, 10], "wrapper": 10, "wrapt": [], "wright": 0, "write": [2, 4, 16], "writeup": 11, "written": [], "wte": 4, "wu": [0, 5, 9, 18, 23, 28], "ww": [], "www": 0, "wxfx24": [], "wy23": [], "wzz": [0, 13], "x": [8, 9, 11, 17, 24], "x_": 28, "x_0": [], "x_1": [], "x_2": [], "x_m": [], "x_t": [], "x_transform": [], "xavier": [0, 9, 28], "xi": 0, "xia": 0, "xiang": [], "xiangyu": [], "xianzhao": [0, 8], "xiao": [], "xiaob": [0, 9], "xiaodong": [], "xiaogang": [], "xiaohuan": [], "xiaoliang": [], "xiaoyu": [], "xie": [0, 5, 9, 28], "xin": [], "xinchao": [], "xing": [], "xingjian": [], "xinhao": 0, "xintao": [], "xinyin": [], "xiufeng": [], "xu": [0, 18], "xubo": 0, "xuchan": [], "xuchen": [0, 9], "xudong": [], "xue": [], "xuefeng": [], "xuezhi": [0, 18], "xun": [], "xyzservic": [], "xzy": [], "y": [0, 8, 9, 17, 18, 24], "y_": [8, 9], "y_1": [8, 9], "y_2": [8, 9], "y_t": [8, 9], "yael": [], "yan": [], "yanbo": [], "yang": [0, 6, 17, 18], "yaniv": [0, 17], "yanqi": [0, 18], "yanzuo": [], "yao": [], "yaofang": [], "yariv": [], "yarl": [], "yaron": [], "yatong": [], "yazh": [0, 28], "ycontributors21": [], "ycy17": [0, 13], "ye": 27, "year": [7, 8, 9, 11, 22], "yee": [], "yellow": 13, "yen": [], "yeong": [], "yeongmin": [], "yesil": 0, "yet": [4, 8, 17], "yeung": 0, "ygp": [], "ygz": [], "yi": [0, 17, 18], "yichun": [], "yijin": [], "yilun": [], "yin": 0, "yinfei": [], "ying": [0, 5, 8], "yinghai": [], "yinghao": [0, 5, 8, 9], "yinhan": [0, 24, 28], "yinhuai": [], "yiqin": [], "yixiao": [0, 5, 23], "yiyi": 0, "yk": [], "yoav": [0, 8, 9], "yogesh": [], "yong": [], "yonghui": 0, "yoo": [], "yoon": [], "york": 15, "yoshua": [0, 17], "yossi": [0, 18], "you": [0, 3, 4, 5, 9, 10, 11, 15, 17, 19, 21, 24], "youngjung": [], "youngmoo": [0, 9], "your": [0, 3, 4, 6, 10, 24], "your_hf_token": 10, "yourself": [10, 15], "youtub": [], "youtube8m": 5, "yt": 5, "yt8m": [], "yu": [0, 8, 17, 18], "yuan": [0, 28], "yuancheng": [], "yuanzhen": [], "yuanzhong": [], "yuchen": [0, 28], "yudong": [0, 5, 8, 9], "yue": [0, 8, 9, 18, 28], "yueh": [], "yufeng": [], "yuhao": [], "yuhta": 0, "yujiu": [], "yukara": 0, "yuki": [0, 6], "yukio": [], "yuliang": [], "yulun": [], "yume": [], "yun": [0, 9], "yunfei": [], "yunfeng": [], "yunjei": [], "yunji": [], "yunxuan": [0, 18], "yupe": 0, "yuqe": [], "yusong": [0, 5, 18, 23, 28], "yutong": 0, "yuval": [], "yuxi": [], "yuxuan": 0, "ywv": [0, 16], "ywz": [], "yxk": [], "z": 11, "z1": 24, "z2": 24, "z_audio": 24, "z_text": 24, "zach": 0, "zachari": [0, 3, 5, 9, 15, 18], "zack": 0, "zackeri": 19, "zada": [], "zal": [0, 5, 18], "zaremba": 0, "zcc": [], "zcdb24": [0, 11], "zeghidour": [], "zehua": 0, "zejun": [0, 8], "zen": [0, 17], "zeqian": [], "zero": [0, 11, 18, 28], "zero_grad": [4, 24], "zettlemoy": [], "zeyu": [], "zhang": [0, 5, 8, 9, 17, 18, 23, 25, 28], "zhang_bertscore_2020": [], "zhao": [0, 6, 18], "zhaoyang": [], "zhen": [], "zheng": [], "zhengdong": [], "zhenhui": [], "zhf": [], "zhi": [0, 6], "zhide": [], "zhifeng": [], "zhihong": [], "zhije": [], "zhiji": [], "zhiqi": [], "zhishuai": [], "zhiyao": 0, "zhizheng": [], "zhong": [0, 6], "zhongyi": [], "zhou": [0, 18, 28], "zhouhang": [0, 5, 9], "zhouyu": [0, 17], "zhu": 0, "zhuang": [], "zhuoyuan": [0, 6], "zihao": [0, 5, 8, 9], "zijian": [], "zip": [], "ziqi": [], "zirui": 0, "ziv": 0, "ziwei": [], "zix": [0, 2], "zixun": [0, 5], "ziyu": 0, "zizheng": [], "zlo": [], "zongyu": [], "zoph": [0, 18], "zornitsa": [0, 8, 9], "zou": [], "zra23": [], "zuchao": [0, 6], "zukowski": 0, "zuluaga": 0, "zwcd23": [0, 11], "zzm": [0, 6], "\u00e0": 0, "\u00e1": [0, 5, 18], "\u00e4": 0, "\u00e4\u00e4": [], "\u00e7": 0, "\u00e9": [0, 18], "\u00eb": 0, "\u00ed": 0, "\u00f6": [0, 8, 9, 18, 28], "\u00fc": [], "\u0103": [], "\u02c6": []}, "titles": ["Bibliography", "Beyond Audio Modality", "Beyond Text-Based Interactions", "Conclusion", "Code Practice", "Datasets", "Evaluation", "Introduction", "Models", "Tasks", "Code Tutorial", "Diffusion Model-based Text-to-Music Generation", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Advances", "Challenges", "The Framework", "Introduction", "Challenges", "Code Practice", "Conversational Retrieval", "Evaluation", "Introduction", "Models"], "titleterms": {"": [4, 24], "1": [4, 16, 24], "2": [4, 16, 24], "3": [4, 16, 24], "4": [4, 24], "A": [], "And": 7, "In": 19, "The": [7, 21], "about": 15, "abstract": 2, "adapt": [8, 21], "address": [], "advanc": 19, "aim": 15, "align": 19, "almost": 16, "anchor": 12, "annot": 17, "answer": 9, "appli": 28, "ar": [8, 22], "architectur": [8, 11, 24, 28], "attent": 21, "attribut": 28, "audio": [1, 10, 12, 14, 15, 28], "audio2audio": 2, "augment": [19, 28], "author": 15, "automat": 6, "autoregress": [8, 21], "ax": 7, "background": 17, "base": [2, 6, 11], "benefit": [25, 28], "beyond": [1, 2, 28], "bibliographi": 0, "brief": [], "build": [4, 24], "call": 19, "caption": [9, 23], "chain": 19, "challeng": [20, 23, 25], "channel": 21, "class": [4, 24], "classif": 9, "code": [4, 10, 24], "codec": 14, "complex": [], "concaten": 21, "conclus": [3, 4, 24], "condit": [8, 11, 21], "connect": 15, "context": 19, "continu": 11, "control": 2, "convers": [9, 25], "creat": [4, 24], "cross": 21, "data": [4, 24, 28], "databas": [], "dataset": [4, 5, 24], "decod": [8, 19, 21], "definit": 13, "denot": [], "describ": [], "descript": [7, 8, 9, 18], "dialogu": [], "diffus": 11, "direct": 25, "distanc": 12, "distil": 19, "distribut": 23, "divers": [12, 28], "do": 7, "don": [], "earli": [17, 27], "effici": 20, "embed": 28, "emploi": 28, "encod": [8, 16, 19, 21], "engin": 24, "environ": [4, 24], "evalu": [6, 12, 26], "exampl": [], "fad": 12, "fall": [], "feedback": 19, "fid": 12, "framework": 21, "friendli": 16, "from": 19, "fr\u00e9chet": 12, "function": [19, 28], "further": 24, "fusion": 8, "futur": 25, "gener": [11, 17, 18, 19], "get": [4, 15, 24], "handl": 28, "hidden": 12, "histori": 13, "human": [5, 16, 19], "i": [7, 16, 28], "implement": 21, "incept": 12, "infer": 24, "initi": 28, "input": 19, "instruct": 8, "interact": 2, "interfac": 16, "introduct": [4, 7, 13, 22, 24, 27], "iter": 11, "joint": 28, "k": 21, "kei": 25, "label": 16, "langaug": [16, 18], "languag": [15, 19, 21, 22], "law": 19, "learn": [16, 19, 24, 28], "let": [4, 24], "leverag": 28, "limit": [12, 19, 23, 25], "listen": 12, "ll": 24, "llm": 8, "load": 4, "loss": 28, "make": 24, "mask": 21, "match": 6, "mc": [], "mean": 12, "method": 27, "metric": [6, 28], "mismatch": 23, "mo": 12, "modal": [1, 28], "model": [4, 8, 11, 18, 19, 21, 22, 24, 28], "modul": [8, 21], "motiv": 15, "mqa": [], "mtc": [], "multi": 28, "multimod": [8, 19], "multipl": 12, "mushra": 12, "music": [2, 7, 8, 9, 11, 15, 17, 18], "musiccap": [], "musicgen": 14, "musictextclip": [], "nativ": 8, "natur": [15, 16], "need": 7, "neg": 28, "neural": 14, "normal": 21, "open": 10, "opinion": 12, "other": 6, "our": [4, 24], "out": 27, "output": 19, "overview": [7, 18], "paradigm": [], "part": [], "perform": 20, "practic": [4, 24], "pre": 28, "precis": 26, "prefix": 21, "prerequisit": [4, 24], "problem": [13, 27], "qualiti": 12, "queri": [23, 28], "question": 9, "rag": 19, "reason": 19, "recal": 26, "refer": [5, 6, 8, 9, 12, 17, 18, 23, 25, 27, 28], "refin": 11, "relev": 12, "represent": [11, 16, 21], "resourc": [4, 24], "result": 4, "retriev": [17, 18, 19, 23, 24, 25, 27], "safeti": 20, "sampl": 28, "scalabl": 16, "scale": 19, "score": 12, "sdd": [], "section": 7, "semntica": 28, "sentenc": 28, "sequenc": 21, "set": [4, 24], "shot": 19, "similar": 28, "singl": [23, 25], "song": [], "sourc": 28, "stabl": 10, "stableaudio": [], "stage": [17, 27], "start": [4, 15, 24], "static": [], "step": [4, 24], "still": [], "stimuli": 12, "strateg": 28, "supervis": 16, "synthet": 5, "system": 25, "t": [], "tag": 28, "tak": [], "task": [9, 16, 19], "technic": 25, "techniqu": 28, "test": 12, "text": [2, 5, 11, 12, 21, 28], "thi": 7, "thought": 19, "through": 11, "tip": 28, "token": 21, "tool": 19, "toward": 28, "train": [4, 8, 24, 28], "transfer": 19, "transform": [], "trust": 20, "tune": 8, "turn": [23, 25], "tutoir": [], "tutori": [7, 10, 18], "type": [6, 9], "umbrella": [], "under": [], "understand": 24, "univers": 16, "up": [4, 24], "us": 19, "vocabulari": 27, "we": [4, 7, 24], "weak": 16, "what": [4, 7, 22, 24, 28], "why": [7, 16], "written": 5, "y": 16, "youtube8m": [], "yt8m": [], "z": 16, "zero": 19}})
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[16, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[16, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[16, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[15, "about-the-authors"]], "Abstract Musical Controls": [[2, "abstract-musical-controls"]], "Adapted LLMs": [[8, "adapted-llms"]], "Adaptive Modulation/Normalization": [[21, "adaptive-modulation-normalization"]], "Advances": [[19, null]], "Aligning Language Models with Human Feedback": [[19, "aligning-language-models-with-human-feedback"]], "Apply Text Augmentation Techniques": [[28, "apply-text-augmentation-techniques"]], "Architecture": [[11, "architecture"]], "Architectures": [[8, "architectures"]], "Audio Diversity and Quality": [[12, "audio-diversity-and-quality"]], "Audio-Sentence Joint Embedding": [[28, "audio-sentence-joint-embedding"]], "Audio-Tag Joint Embedding": [[28, "audio-tag-joint-embedding"]], "Audio2Audio Controls": [[2, "audio2audio-controls"]], "Autoregressive Language Models": [[21, "autoregressive-language-models"]], "Background": [[17, null]], "Benchmarks": [[6, "benchmarks"]], "Beyond Audio Modality": [[1, null]], "Beyond Text-Based Interactions": [[2, null]], "Beyond semntica attributes, toward handle similarity queries": [[28, "beyond-semntica-attributes-toward-handle-similarity-queries"]], "Bibliography": [[0, null]], "Chain-of-Thought Reasoning of Language Models": [[19, "chain-of-thought-reasoning-of-language-models"]], "Challenges": [[20, null], [23, null]], "Channel Concatenation": [[21, "channel-concatenation"]], "Code Practice": [[4, null], [24, null]], "Code Tutorial": [[10, null]], "Conclusion": [[3, null]], "Conclusion \ud83c\udf89": [[4, "conclusion"], [24, "conclusion"]], "Conditioning": [[11, "conditioning"], [21, "conditioning"]], "Conditioning and Fusion": [[8, "conditioning-and-fusion"]], "Connecting Music Audio and Natural Language": [[15, null]], "Conversational Music Description": [[9, "conversational-music-description"]], "Conversational Retrieval": [[25, null]], "Datasets": [[5, null]], "Diffusion Model-based Text-to-Music Generation": [[11, null]], "Diffusion: Continuous Generation through Iterative Refinement": [[11, "diffusion-continuous-generation-through-iterative-refinement"]], "Distillation of Language Models": [[19, "distillation-of-language-models"]], "Early Stage Retrieval Methods": [[27, "early-stage-retrieval-methods"]], "Early Stage of Music Annotation and Retrieval": [[17, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[17, "early-stage-of-music-generation"]], "Employ Strategic Negative Sampling": [[28, "employ-strategic-negative-sampling"]], "Encoder-Decoder Attention (a.k.a. Cross Attention)": [[21, "encoder-decoder-attention-a-k-a-cross-attention"]], "Encoder-Decoder Models": [[8, "encoder-decoder-models"]], "Evaluation": [[6, null], [12, null], [26, null]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[12, "frechet-inception-distance-fid-fad"]], "Future Directions": [[25, "future-directions"]], "Getting Started": [[15, "getting-started"]], "History": [[13, "history"]], "Human-written text": [[5, "human-written-text"]], "Implementing Language Models": [[21, "implementing-language-models"]], "Inception Score": [[12, "inception-score"]], "Inference & Make Retrieval Engine": [[24, "inference-make-retrieval-engine"]], "Initialize with Pre-trained Models": [[28, "initialize-with-pre-trained-models"]], "Introduction": [[4, "introduction"], [7, null], [13, null], [22, null], [24, "introduction"], [27, null]], "Key Benefits of Conversational Retrieval": [[25, "key-benefits-of-conversational-retrieval"]], "Key Technical Challenges": [[25, "key-technical-challenges"]], "Langauge Models": [[18, "langauge-models"]], "Language Models as a Framework": [[21, "language-models-as-a-framework"]], "Let\u2019s Get Started! \ud83d\ude80": [[4, "let-s-get-started"], [24, "let-s-get-started"]], "Leverage Diverse Training Data Sources": [[28, "leverage-diverse-training-data-sources"]], "Limitation": [[12, "limitation"]], "Limitations": [[6, "limitations"], [19, "limitations"]], "Limitations of Single-Turn Systems": [[25, "limitations-of-single-turn-systems"]], "Listening Test": [[12, "listening-test"]], "MOS Test (Mean Opinion Score)": [[12, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[12, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Masked Language Models": [[21, "masked-language-models"]], "Match-based metrics": [[6, "match-based-metrics"]], "Metric Learning Loss Functions": [[28, "metric-learning-loss-functions"]], "Models": [[8, null], [28, null], [28, "id3"]], "Motivation & Aims": [[15, "motivation-aims"]], "Multi-modal Joint Embedding Model Architecture": [[28, "multi-modal-joint-embedding-model-architecture"]], "Multimodal AR Models": [[8, "multimodal-ar-models"]], "Multimodal Decoders for Language Model Outputs": [[19, "multimodal-decoders-for-language-model-outputs"]], "Multimodal Encoders for Language Model Inputs": [[19, "multimodal-encoders-for-language-model-inputs"]], "Music Captioning": [[9, "music-captioning"]], "Music Classification": [[9, "music-classification"]], "Music Description": [[18, "music-description"]], "Music Generation": [[18, "music-generation"]], "Music Question Answering": [[9, "music-question-answering"]], "Music Retrieval": [[18, "music-retrieval"]], "Music description datasets.": [[5, "description-datasets"]], "Music description models.": [[8, "description-models-table"]], "MusicGEN": [[14, null], [14, "id4"]], "Natively Multimodal AR Models": [[8, "natively-multimodal-ar-models"]], "Neural Audio Codec": [[14, "neural-audio-codec"]], "Overview of Tutorial": [[18, null]], "Overview of this tutorial section": [[7, "overview-of-this-tutorial-section"]], "Performance & Efficiency": [[20, "performance-efficiency"]], "Precision and Recall": [[26, "precision-and-recall"]], "Prefix Conditioning": [[21, "prefix-conditioning"]], "Prerequisites": [[4, "prerequisites"], [24, "prerequisites"]], "Problem Definition": [[13, "problem-definition"]], "Problem: Out of Vocabulary": [[27, "problem-out-of-vocabulary"]], "Query-Caption Distribution Mismatch": [[23, "query-caption-distribution-mismatch"]], "References": [[5, "references"], [6, "references"], [8, "references"], [9, "references"], [17, "references"], [18, "references"], [23, "references"], [25, "references"], [27, "references"], [28, "references"]], "Representation": [[11, "representation"]], "Representation: Text as Sequence of Tokens": [[21, "representation-text-as-sequence-of-tokens"]], "Resources for Further Learning \ud83d\udcda": [[24, "resources-for-further-learning"]], "Resources \ud83d\udcda": [[4, "resources"]], "Results \ud83d\udcc8": [[4, "results"]], "Retrieval-Augmented Generation (RAG)": [[19, "retrieval-augmented-generation-rag"]], "Scaling Laws of Language Models": [[19, "scaling-laws-of-language-models"]], "Single-Turn Retrieval Limitations": [[23, "single-turn-retrieval-limitations"]], "Stable Audio Open Tutorial": [[10, "stable-audio-open-tutorial"]], "Step 1: Setting Up Our Environment": [[24, "step-1-setting-up-our-environment"]], "Step 1: Setting up our environment": [[4, "step-1-setting-up-our-environment"]], "Step 2: Loading the data \ud83d\udcca": [[4, "step-2-loading-the-data"]], "Step 2: Understanding the Data \ud83d\udcca": [[24, "step-2-understanding-the-data"]], "Step 3: Creating Our Dataset Class \ud83c\udfa8": [[24, "step-3-creating-our-dataset-class"]], "Step 3: Creating our dataset class \ud83c\udfa8": [[4, "step-3-creating-our-dataset-class"]], "Step 4: Building & Training Our Model Architecture \ud83c\udfd7\ufe0f": [[24, "step-4-building-training-our-model-architecture"]], "Step 4: Building and training our model \ud83c\udfd7\ufe0f": [[4, "step-4-building-and-training-our-model"]], "Synthetic Text": [[5, "synthetic-text"]], "Tasks": [[9, null]], "Text Relevance": [[12, "text-relevance"]], "The Framework": [[21, null]], "The axes of music description": [[7, "the-axes-of-music-description"]], "Tips for Training Audio-Text Joint Embedding Models": [[28, "tips-for-training-audio-text-joint-embedding-models"]], "Tool Use and Function Calling": [[19, "tool-use-and-function-calling"]], "Transfer Learning from Language Models": [[19, "transfer-learning-from-language-models"]], "Trust & Safety": [[20, "trust-safety"]], "Types of music captioning": [[9, "types-of-music-captioning"]], "What We\u2019ll Build": [[24, "what-we-ll-build"]], "What are language models?": [[22, "what-are-language-models"]], "What is music description? And why do we need it?": [[7, "what-is-music-description-and-why-do-we-need-it"]], "What is the Benefit of Joint Embedding?": [[28, "what-is-the-benefit-of-joint-embedding"]], "What we will build": [[4, "what-we-will-build"]], "Why Natural Langauge?": [[16, null]], "Zero-shot Task Transfer and In-Context Learning": [[19, "zero-shot-task-transfer-and-in-context-learning"]]}, "docnames": ["bibliography", "conclusion/beyondaudio", "conclusion/beyondtext", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/advances", "lm/challenges", "lm/framework", "lm/intro", "retrieval/challenge", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/evaluate", "retrieval/intro", "retrieval/models"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/beyondaudio.md", "conclusion/beyondtext.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/advances.md", "lm/challenges.md", "lm/framework.md", "lm/intro.md", "retrieval/challenge.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/evaluate.md", "retrieval/intro.md", "retrieval/models.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [0, 2, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27, 28], "0": [4, 5, 9, 10, 11, 12, 16, 24, 26, 28], "00": [], "000": 22, "000061": 10, "00006103515625": 10, "00341": [0, 18], "0050b2820a1e709ffa623f9a9e8ae42d0903535f2150613cbfeb7f16932a": [], "00512": [], "0083": 24, "00830": [], "0092": 4, "00927": [], "01": [0, 9], "01095": [], "01103": 0, "01324": [], "01337": [0, 6, 8], "01420": [0, 5], "01546": [], "01618": [], "01626": [], "01652": [0, 18], "01733": [], "01840": [], "019": [], "01917": 0, "02": [0, 4, 8, 24], "021c1d407befb505791764ad2cbd56ceaaa53a746baed01d2e2143f05f18": [], "02252": 0, "02257": 0, "02696": [], "03": [0, 9], "03458": [], "03499": [0, 17], "03739": [], "03748": [0, 28], "03917": [], "04": [0, 5, 8, 9], "04208": [], "04378": [], "04628": [], "04658": [], "04805": [0, 18, 28], "04868": [0, 8], "05": [], "05011": 0, "05224": [], "056d58b606731f94fe395266c604ea9efcecc10e6857ceb9b10e6831d746": [], "0577": 4, "0583": 4, "0586": 4, "0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822": [], "05967": [], "06": [0, 9], "06125": [], "06174": [], "06178": 0, "0686": [], "07": [0, 5, 8, 9, 24], "0702": 4, "07069": [0, 18], "07160": [0, 8, 9, 18], "07439": [0, 23, 25], "07724": [], "07837": [0, 17], "07848": [], "0791": 4, "07919": [0, 8], "08": [0, 6, 8], "08070": [], "08384": 0, "08466": [], "08667": 0, "08691": [], "08774": [0, 18], "08803": [], "09": [0, 6], "0933": 4, "09636": [], "0984": 4, "0a": [], "0a0": [], "0a1": [], "0b": [], "0da8e798b168": 4, "0dfc83e0fe455cfe6272b23a65039b4101c63a4e7446801e26178b675fbf": [], "0ea5e3611e0b63766a56f81e7bc5cfa05c52e3a3f0b8d66b25c7262aeda": [], "0m": [], "1": [0, 2, 5, 6, 8, 9, 10, 11, 12, 15, 18, 19, 26, 28], "10": [0, 4, 5, 6, 8, 9, 10, 15, 24, 28], "100": [4, 12, 24], "1000": 10, "10057": [0, 5, 23], "10191775": [0, 9], "1024": [4, 24], "1025": [0, 23], "10301": [], "1032d0dbc2152c45f3d1e582a72e68f41898de9665202392d9400dfa329d": [], "1038": [], "10447027": [0, 5, 8], "1045": [0, 23], "104e9f575c27679ffedf994e53e6ac39067a0e77b2ea0d1567d4738686": [], "106": [], "1068": 0, "1076": [0, 9], "1077": 0, "10789": [], "10790": [0, 8], "10828fb40dcf097d1af84c1f2f863bae4046d5949450bf95b3260f767672": [], "109": [0, 6], "10970": 0, "10f97f73544edcdef54409f1d839f6049a0d79df68adbc1ceb24d1aaca42": [], "11": [0, 6], "1109": [0, 5, 8, 9], "1116": 0, "1120": 0, "11255": [0, 5, 8], "11305": 0, "11315": 0, "11325": [0, 5, 18], "113k": 5, "114": [], "11401": [0, 8, 9], "11415": [0, 8, 9], "1141a8232723dcb10a595cc0ce4321dcbbd5215300bf4acfc142343205bf": [], "1146": 24, "11489": [0, 25], "11498": [0, 28], "114m": [], "115": [], "1165": 4, "11692": [0, 6, 28], "11757": [], "1180": 0, "11834": [0, 8], "1186": [], "1188": 0, "11994": [], "11k": 5, "12": [0, 16], "12015": [], "1208": [0, 9], "120bpm": 16, "121": [], "1212": [0, 9], "12179": [], "12207897848a653d03ebbf6775a29d949408ded5f99b2d87198bc5c93508": [], "12208": [0, 18, 28], "12415": [0, 18, 28], "125": 0, "125817600": 4, "12661": [], "12662": 0, "1267": 4, "128": [4, 24], "12839": [], "13": [0, 16], "130bpm": 16, "13218": [], "133": [0, 8], "13301": [], "13438": 0, "13569": [0, 28], "1362": 0, "13686": [], "1371": 0, "13731": [], "14": [4, 15], "140": [0, 10, 18], "1412": [], "14167": [], "142": [0, 8], "1426": 4, "14358": 0, "1446": 4, "14784": [0, 5], "14793": [0, 5], "1481": 4, "14867": [], "149": [], "14rn7hpkvk": [0, 8], "15": 28, "150": 16, "15018": [], "150k": 8, "1514580907b0bac0970415e5e24ef96a9c1fa71dcf2aa0139045b58fae9a": [], "1534": 0, "15573": [0, 6, 8], "156": [], "15885": [], "16": [0, 8, 12, 13, 16, 17, 18, 27], "1601": [4, 24], "16020": [0, 6], "1604": [], "1608": [0, 8], "1609": [0, 17], "1612": [0, 17], "162": 0, "163": [], "16322": [], "16372": [], "16501": [0, 8, 9], "16512": [0, 8, 9], "1679": 24, "16798": [0, 9], "17": [0, 12, 13, 14, 17], "17042": [], "17162": [], "173": 0, "179": [], "179dd1bf8fd6bd689f0907f4baed557d2b12d2cf3d7ed1a8ecefe0a63d83": [], "17a": 0, "17b": [0, 13], "17th": [0, 8], "18": [0, 13, 17, 18], "1802": [], "1805": [], "1807": [0, 28], "1810": [0, 18, 28], "1812": [], "18407": [], "18503": [], "18653": [0, 6, 8, 9], "1869": 4, "1874": 4, "18754": 24, "18828": [], "18th": 0, "19": [0, 13, 18], "1907": [0, 28], "19159": [], "1937": [], "194": 0, "1950": [], "19512": [], "1964": [], "1970": 13, "1975": [], "1979": [0, 6], "1982": 0, "1983": 0, "1989": 0, "1990": 13, "1992": [], "1998": [0, 6], "19d5ff584cb58f654d22d8d6552d7c2fff7b85e4a9d525357f62a4d1e7e0": [], "1a": [], "1b69b697fe067d51219cfd64d0712bcbbce3b187389cb0793d9844ec14b1": [], "1bdb57a072903b222b1a745aa634cb845ff5f52a88ddd5ed1640ecf30beb": [], "1c": [], "1d": [11, 14], "1e": [4, 24], "1f": [], "1f0a22a6bcdd3fc26c73f63a025d05bd565901b729d56bcb093c722a6c4c": [], "1k": [], "1m": 11, "2": [0, 2, 3, 5, 6, 8, 10, 11, 15, 17, 18, 19, 27], "20": [0, 8, 12, 13, 15, 18], "200": 27, "2000": 17, "2001": 0, "2002": [0, 9], "2003": [0, 9], "2005": [0, 17, 18, 27], "2007": [0, 17], "2008": [0, 17, 18, 27], "2009": [], "2010": [0, 9, 17, 23], "2012": [], "2013": [], "2014": [], "2015": 13, "2016": [0, 8, 17], "2017": [0, 9, 17], "2018": [0, 13, 17, 18, 28], "2019": [0, 17, 18, 28], "202": 0, "2020": [0, 13, 18], "2021": [0, 8, 9, 11, 15, 18, 28], "2022": [0, 8, 9, 18, 28], "2023": [0, 5, 8, 9, 18, 23, 25, 28], "2024": [0, 5, 6, 8, 9, 15, 18, 23, 25, 28], "20445": [0, 5, 8, 9], "207": [], "20a": 0, "20b": [0, 13], "20xx": [], "21": [0, 6, 8, 9, 11, 16, 18, 28], "2104": [], "2109": [0, 18], "2110": [], "2111": 0, "214": [], "21450": 0, "21474": 0, "2161": [0, 9], "21783": [], "21th": 0, "22": [0, 8, 13, 16, 18, 28], "2202": [], "2204": [], "2205": 0, "22050": [4, 24], "2206": [], "2208": [0, 18, 28], "2210": 0, "2211": [], "2226": 0, "2231": 4, "2234": 0, "22a": [], "22b": [], "22k": 5, "23": [0, 2, 5, 8, 9, 11, 12, 13, 18, 23, 25, 28], "2301": [0, 5, 18, 25], "2302": 0, "2303": [0, 18], "2304": [], "2305": [0, 8], "2307": [], "2308": [], "231": [0, 5, 8, 9], "2310": [0, 8, 9, 18], "2311": [0, 5, 8, 18, 23], "2312": [], "2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324": [], "2350": 0, "2354": 0, "2358": 4, "237m": [], "238": 0, "2392": [0, 9], "2396": [0, 9], "23a": [0, 11], "23b": [0, 13], "23ef2fd02913d65d43dc7516fc829af709314a66c6f0bdc2e361fdcecc2d": [], "24": [0, 2, 5, 6, 8, 9, 11, 13, 14, 18, 23, 25], "2401": [], "2402": 0, "2403": [], "2404": [0, 28], "2405": [], "2406": [0, 6], "2407": [0, 5, 8, 9], "2408": [0, 6, 8], "2409": [0, 28], "2410": [0, 6, 8], "2411": [0, 6, 23, 25], "249": [], "24963": [0, 8], "24a": 0, "24b": 0, "24th": 0, "25": [0, 18], "25bcf75e373412daf1fd88045ab3aa8140a0d804ef0e70712c4f2c5b94d8": [], "25h": [], "25hcollect": [], "25hdownload": [], "25hrequir": [], "25l": [], "25th": [0, 6, 8, 15], "26": [0, 8], "26045404a30c8a200e960fb54fbaf4b73d12e58cd28e03b306b084253f4f": [], "262145": 24, "263": [], "264k": 5, "265": 10, "266": [], "27": [], "2713830": [0, 9], "273186269": 0, "2754": [], "2764": [], "2788": 24, "28": 0, "28492": [], "28518": [], "286": [0, 5, 8], "287": [], "2880": 0, "2894": 0, "28k": 5, "28th": 0, "29": 5, "290": [0, 5, 8], "2919": 24, "293": [0, 9], "2971": 24, "2a": [], "2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7": [], "2b": [], "2c": [], "2d": 11, "2d1c0ebfd092e25935b86509a9a817159212d82aa43d7fb07eca4eeff2c2": [], "2d231b35456506b7c98b3ab9bbf07917b205fed8615d2e59e976ab497fff": [], "2d512efdb0de203d1f0312fae53433c3009ba70b0078421d25baaedc960a": [], "2e": [], "2eb3cd785efd67806c46c13a17339708ddc346cbb684eade7a6e6f79536a": [], "2f": [], "2k": 5, "2m": 5, "2min": 5, "2ugen": [], "3": [0, 3, 5, 6, 8, 9, 10, 15, 17, 18, 19], "30": [5, 9, 10, 11, 22], "300": [4, 24], "302": [0, 9], "30aa32745af16af0a9a650115fbe81bde7c610ed5c21b381fca0196f3a7f": [], "31": [], "3122": 4, "313": 0, "3169": [], "317": [], "31884": [4, 24], "319": [], "31k": 8, "31m": [], "31m1": [], "31m10": [], "31m108": [], "31m11": [], "31m12": [], "31m122": [], "31m13": [], "31m14": [], "31m141": [], "31m15": [], "31m16": [], "31m17": [], "31m172": [], "31m191": [], "31m2": [], "31m3": [], "31m4": [], "31m470": [], "31m493": [], "31m5": [], "31m6": [], "31m7": [], "31m742": [], "31m768": [], "31m796": [], "31m8": [], "31m834": [], "31m836": [], "31m837": [], "31m839": [], "31m845": [], "31m848": [], "31m849": [], "31m85": [], "31m853": [], "31m855": [], "31m860": [], "31m861": [], "31m868": [], "31m872": [], "31m874": [], "31m878": [], "31m884": [], "31m890": [], "31m897": [], "31m9": [], "31m900": [], "31m904": [], "31m913": [], "31m918": [], "31m920": [], "31m921": [], "31m925": [], "31m937": [], "31m942": [], "31m947": [], "31m949": [], "31m95": [], "31m973": [], "31m978": [], "31m982": [], "31m995": [], "31merror": [], "32": [0, 4, 9, 11], "324": 0, "326": 0, "32767": 10, "32m0": [], "32m1": [], "32m10": [], "32m106": [], "32m11": [], "32m112": [], "32m12": [], "32m121": [], "32m122": [], "32m13": [], "32m14": [], "32m143": [], "32m149": [], "32m15": [], "32m16": [], "32m162": [], "32m163": [], "32m17": [], "32m174": [], "32m179": [], "32m18": [], "32m19": [], "32m2": [], "32m20": [], "32m207": [], "32m21": [], "32m214": [], "32m22": [], "32m23": [], "32m24": [], "32m25": [], "32m26": [], "32m266": [], "32m27": [], "32m28": [], "32m287": [], "32m29": [], "32m3": [], "32m30": [], "32m31": [], "32m317": [], "32m319": [], "32m32": [], "32m33": [], "32m333": [], "32m34": [], "32m35": [], "32m36": [], "32m368": [], "32m37": [], "32m38": [], "32m389": [], "32m39": [], "32m392": [], "32m399": [], "32m4": [], "32m40": [], "32m41": [], "32m42": [], "32m43": [], "32m434": [], "32m44": [], "32m45": [], "32m46": [], "32m47": [], "32m48": [], "32m481": [], "32m49": [], "32m5": [], "32m50": [], "32m51": [], "32m519": [], "32m52": [], "32m53": [], "32m54": [], "32m55": [], "32m56": [], "32m563": [], "32m59": [], "32m6": [], "32m60": [], "32m61": [], "32m614": [], "32m616": [], "32m63": [], "32m64": [], "32m7": [], "32m71": [], "32m727": [], "32m73": [], "32m76": [], "32m77": [], "32m774": [], "32m78": [], "32m8": [], "32m81": [], "32m87": [], "32m890": [], "32m899": [], "32m9": [], "32m90": [], "32m92": [], "32m94": [], "33": [], "331": 0, "333": [], "33437": 24, "33k": 5, "34": 0, "3479": 4, "34th": 0, "35": [], "3523": 24, "3572": 24, "35th": 0, "36": [], "360": [], "3643": [0, 5, 8, 9], "3655": [0, 5, 8, 9], "368": 0, "36m": [], "36m0": [], "37": [], "3727": 24, "375": 0, "38": [], "39": 4, "392": [], "39c7c0d87f8d4e6c020a393182060eaefeeae6c01dab6a84ec346f2567df": [], "3a": [], "3af39d34be01a24a6e65433d19e107099374224905f1e0cc6bbe1fd22a2f": [], "3b": [], "3b00ac340a1aab3389ebcc52c779914a44aadf7b0cb7a3bf053195735607": [], "3c": [], "3d": [], "3e": [], "3f": [], "3k": [4, 24], "3m": 10, "4": [0, 3, 5, 6, 7, 10, 15, 16, 18], "40": [], "41": 0, "42": [0, 16, 28], "43": [], "434": [], "435d5d7ec64d1c8b422ac9ebe42d2f3b2ac0b3f8a56f5c04dd0f3b7ba83c": [], "4361": 0, "4370": 0, "44": 11, "440": [], "4407": [0, 9], "44100": 10, "45": [4, 24], "4524": 4, "454d6e7f0158951d8a78c2e1eb4f69ae81beb8dca5fee9809c6c99e9d0d0": [], "456": 0, "4583": [0, 28], "4587": [0, 28], "46": [], "460": 0, "46649": 24, "467": [0, 17, 18, 27], "46th": [0, 18, 25], "47": [], "476": [0, 17, 18, 27], "48": [], "48072": 24, "48550": [0, 6, 8], "4868": 24, "49": [], "4b": [], "4c": [], "4c4672025c23a305231a81bf492f65aa3ea0965a89f9ca369a9ee7d47fd9": [], "4d": [], "4e": [], "4f": [4, 24], "4f639c1168d7aada749a896afb4892a831e2041bebdcf636aebfe9e86556": [], "4o": 19, "5": [0, 3, 4, 5, 9, 10, 12, 16, 18, 23, 24, 25, 28], "50": [4, 10, 27], "500": 10, "5063": 24, "51": 5, "519": [], "52": [], "521": [0, 8], "5244": 24, "525": [], "53": [0, 18], "5302": 24, "531": [0, 17], "534": [0, 17], "53k": 5, "54": [], "540": [], "541": [], "55": [], "5593a40fcd0981bda85274bb3e622ac433a94ae1e11ef8639de362cfa7d": [], "55bpm": 16, "55cdeed5889f2076fdb125bc87bb7ab0f1715c84b0a4619c44833d890f60": [], "56": [0, 28], "564beb0c78bf83018a146dfcdc959c99c10a0d136480b932a350c852adbc": [], "566": [], "57": [], "5730cc60bf438b56438756e45ac469c01bcf9c47d87632c468623167b7f": [], "5781": 4, "58": [], "580600f441f6fc05218bd6c9d5794f4aef072a7d9093b291f1c50a9db8bc": [], "58b70a580de00893223d61de8fea167877a3aed97d4a5e1405c9159ef925": [], "58d71f2041bc89919f56a69f8f2b9535a55d513bb005fbe4f8ee5d367170": [], "59": [], "591": [0, 28], "595": [0, 28], "5a": [], "5a36494314e4780362b15a7e190095eec68366a0d512b5b532607c213a26": [], "5af6804c4cc0fed83f47bff6e413a98a36618e7d40185cd36e69737f3b0": [], "5b": [], "5c": [], "5d": [], "5e": [], "5f30aea01532961bab043775258b06484f2a57530a88940e4cc3aea4f1f1": [], "5k": 5, "5min": 5, "6": [4, 5, 16, 24, 28], "60": [], "607": [], "608": 10, "609961972f694cb9520c4c3d201e377a26583e1eb83bc5a334c893729214": [], "60cd92bd3ec00948800984410f4cf5ded5bd8e9b715729f3642efe0edb3d": [], "61": [0, 23], "616": [], "61b627404c2d6f31dcbc491ff83da1f4336c7ae7893cfdc6c52db490ec59": [], "621": [], "6262": 4, "62nd": [0, 6, 8], "63": [], "6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8": [], "64": 11, "6402242dde160d9ef9903487b4277443dc3da04615f6c4d3b48564a8ab57": [], "65": [], "66": [], "661": [], "6626": 0, "6637": 0, "67": [0, 18], "671c0e1f2572ba625cbcc1faeba9435e00330c3d6962858711445cf1e817": [], "6724805521ab4e723a12182f92374031032aff28a8a89dc8505c52b79032": [], "6742ef9206409d5ce1fdf44d5ca1687cdc3847ba0485424e2c731e6bcf67": [], "67ebd9d6ce9e65747e720c4c5614cd3a137e61340aec274657fcd9cc5162": [], "68": [], "6809": 24, "681": [], "684": [], "69": [], "693": [], "6980": [], "6a": [], "6b": [], "6d": [], "6e": [], "6e30b6b0cc0c18f8eb566e4f440e8127d9dad32bcaa70d38c8c44a21e62d": [], "6e9f9b41c48750a45ad07cc6d43a2979bfc09e6989656aece97cc59cbef1": [], "6f": [], "7": [4, 5, 10, 16, 24, 25], "70": [0, 18], "7047": 24, "71": [], "72": [], "72a58cb3b241d869811be4f9328a37f1563dc9c48af8c0467cb681f9ed46": [], "73": [], "74": [], "75718504a1bf0562e7e02def34cfc9bb274b6f284773cbeeeba0767a31b": [], "75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40": [], "76": [], "7616": 0, "7633": 0, "768": 24, "77": 0, "774": [], "7762": [0, 8], "7770": [0, 8], "77cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1": [], "77edf4c29c8d6728b49d3f0abb22159bb9c0c4ddebd721c09486b34985c8": [], "78": [], "784": [0, 8, 9], "78bd0e95dd2444b6caacbca2b730671d4295ccb628ef58b81bee903629df": [], "7907": 24, "7925": 24, "7952585": [0, 9], "7b": 5, "7b5a1a5419e400f715387a48f65225ec7a3f2104465f346fc75e8793407b": [], "7c": [], "7dcce24e978bc14a18e2a3f7e2d6f4d2001533dc0cffab143bb3f8ec13d6": [], "7e": [], "7f": [], "8": [0, 4, 5, 8, 9, 16, 18, 24, 28], "80": 0, "800560": [0, 9], "80370da514096c6190f8913668198380ea09c2d252cfa4e85a9c096d3b40": [], "804": 0, "807": 0, "80cc3315dd1ca706643b78f894901d4d888ffe376a5e401f73d9db61071": [], "81": [], "8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c": [], "81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82": [], "828": [], "83": [], "83871f3c50fc983b88547c196d11cf8c3340e37c32d2e9d6152abe2c61f7": [], "84": 0, "8462": 24, "85": [], "85249acbac630f34cd113dca4b1a72f55d3ad4c26bc9305a27aef6049756": [], "859": [0, 8], "86": [0, 9], "8630": 4, "8653ae6d18e20183fc6051fd2e10cd0c46e16a6b71eb34edef8d465dc969": [], "86bb218c7926e1da7a52e0696cab120a17c995933f08d8228d9aa83b44c5": [], "87": [], "8748": [0, 28], "8763": [0, 28], "88": [], "8821": [], "8831": [], "88k": 5, "89": [], "890a583cd3f2be27ecf32b479d5d615710bb926d92da03e3f7838ff3e58b": [], "899": [], "8a": [], "8b5d82fe2d9c7f260fb73121418f5e07d4e38c329ea3886a5b0e55586113": [], "8c": [], "8c75caed8f2462d63c7fd65e16c832b8f76cda331ac9e615e914ee80bac9": [], "8d": [], "8da8dd078b354a89602a875d310a0d725dad92b5b4d61069576e0a0e02e4": [], "8dd4d6de0fbba9d8f10d7b655be0578d5bda6e4db425210c265b0ea6c804": [], "8df4efa78df8b129847c8a7c0e492376cca62ab68453e5a20375a1c6291b": [], "8df927d3f0951cf67ca5973d89b35bcbda1777a4c78bf90a853d02d91285": [], "8e": [], "8f": [], "8f0c4a5bb9fd491c277c21eff7ccae71b47d43c4446c9d0c6cff2fe8c2c4": [], "8f8e631fcdc2ff978609eaeef1d6994bf2f028b59d9ac67640ed051f1218": [], "8k": 5, "9": [4, 5, 24, 28], "90": 4, "9048": 24, "9090": 4, "91": [], "917": [23, 25], "92": [], "9240": 24, "927e3a8899e52a27fa57a48607ff7dc91a9ebe97399b357b85a0c7892e00": [], "93": [], "9315": 4, "937": [0, 9], "9375917786cb39270b0ee6634536c0e22abf225825602688990d8f5c6c19": [], "9377bcb415797e44274b51d46e3249eba641711cf3348050f76ee7b15ffc": [], "93f7309eb40a9299c59a6637f13c21b08e585c569fee85901ccd55ce00f5": [], "94": [], "943": [], "94797cfe0263a30805f3074e535adfde02b885ac43d1e4dac85f82213b0b": [], "94c7dab8cfe7d41a23133634576fb89412e3430f28ca8d44411a77c2f18d": [], "95": [], "952": [0, 9], "953": [], "96": 11, "96142937f66150805c25c4d0f31ee4132fd33497753400734f9dfdcbdc66": [], "9637": [0, 8], "9662": [0, 8], "9748": 4, "98": [], "99": [], "9963d588cc3d75d766c819e0377a168ef83cf3316a92769971527a1ad1d": [], "9a": [], "9a683359ad2ed11b2303a7a94800db19c61d33fa3bde271df09e99936022": [], "9b": [], "9b2eab7833494e7c82f70c9b2f8e907d38231f4535704e3045a8a4960c8": [], "9c": [], "9cf1a409640adac045750b2ba9d1355c83942fbae74f21284c2133292be": [], "9eb14d4e9ef366be2020063d91c4f608294969fcd7b9fcc48153c64b9776": [], "9f1413bef53171f379d786aabc104d4abeea48ee84c553a3e3d8c9f96a9c": [], "9f1894efa1bb15e98613244b24dfbacfe2309e0ac3cfc27d4c608c2270d2": [], "9k": 5, "A": [0, 4, 5, 6, 8, 9, 11, 12, 17, 19, 21, 24, 26], "AND": 27, "AT": [], "And": [4, 11, 15], "As": [2, 3, 4, 6, 7, 8, 11, 14, 17, 19, 20, 22, 25, 27], "At": [8, 28], "BY": 5, "Being": [7, 20], "But": [4, 19, 21, 24], "By": [13, 24, 26, 28], "For": [2, 4, 6, 7, 8, 9, 11, 16, 19, 21, 22, 23, 24, 25, 27, 28], "If": [8, 9, 10, 11, 17, 19, 22], "In": [0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 27, 28], "It": [0, 4, 8, 18, 19, 21, 22, 24, 26], "Its": [4, 19], "NOT": 27, "No": [4, 10, 16, 24], "OR": 27, "Of": 8, "On": 2, "One": [8, 9, 11, 12, 28], "Or": [0, 4], "That": 21, "The": [0, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28], "Their": [25, 28], "Then": [11, 13], "There": [2, 11, 20, 21, 23], "These": [2, 4, 5, 6, 8, 11, 12, 13, 17, 19, 21, 23, 27, 28], "To": [2, 4, 6, 10, 11, 12, 15, 19, 23, 28], "Will": [], "With": [4, 15, 19], "_": [8, 11, 22, 24, 28], "_0": [8, 11], "_1": 26, "_2": 26, "__getitem__": [4, 24], "__init__": [4, 24], "__len__": [4, 24], "_brownian": 10, "_c": [], "_end": 10, "_get_default_devic": [], "_i": 8, "_n": 26, "_q": 26, "_t": [8, 11], "a1": [], "a2": [], "a2t_project": 4, "a3": [], "a39c835871caca0173f526e321336a1a2b0961e38bf9b71b7213b651e3c8": [], "a4": [], "a5": [], "a6": [], "a61ef6f7faf98edadf4ce8094873d298f8582a3ec59b65c9174c516926e8": [], "a6c031bc1590789a3da14bd6a9cccc46c932401765d6d8f37e75c8214b44": [], "a7": [], "a8": [], "a812df4e2dd5696d1f351d58b8fe16a405b234ad2886a0dab9183fb78109": [], "a_i": 28, "aa": [], "aaa": [0, 18], "aaai": [0, 15], "aaron": [0, 17, 28], "ab": [0, 6, 8, 10], "ab44c871b0f07f491e5d2ad12c9bd7358e527510618cb1b803a88e986db1": [], "abbeel": 0, "aberman": [], "abhimanyu": [], "abhinav": [], "abhishek": 0, "abi3": [], "abil": [0, 8, 11, 12, 16, 17, 28], "abl": [7, 11, 12, 21, 23], "ablat": 12, "about": [0, 2, 3, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 26, 28], "abov": [8, 11, 13, 14, 16, 19, 21], "abraham": [], "absent": 23, "absl": [], "absl_pi": [], "abstract": [0, 3, 6, 7, 8, 9, 22], "abu": [0, 8, 9], "academ": [3, 15], "acceler": [0, 15], "access": [3, 15], "accompani": [0, 2, 4, 5, 9, 13, 24], "account": [4, 6, 10, 24], "accur": [2, 12, 13, 19, 28], "accuraci": [19, 26, 28], "achiam": [0, 18], "achiev": [3, 4, 8, 9, 13, 19], "acl": [0, 6, 8], "aclanthologi": [0, 5, 6, 8, 9], "acm": [0, 18, 25], "acoust": [0, 5, 8, 9, 16, 18, 23, 24, 28], "acquir": [], "across": [2, 13, 15, 19, 23, 25, 26, 28], "activ": [0, 8, 11, 15, 27], "actual": [11, 16, 23, 26], "ad": [6, 11, 17, 19], "adaln": 11, "adam": [0, 5, 17, 18], "adamw": [4, 24], "adapt": [9, 11, 12, 13, 19, 28], "adb": [0, 5, 13, 18], "add": [4, 11, 24], "add_special_token": 4, "addit": [2, 8, 9, 11, 12, 15, 19, 21, 23], "addition": [2, 12, 13, 16, 21, 25, 27, 28], "address": [2, 3, 8, 9, 18, 19, 20, 23, 25, 27, 28], "adi": [0, 18], "aditya": [0, 28], "adjust": 19, "adler": [0, 18], "admiss": 6, "adob": 15, "adobephotoshopsenseiarteam": [], "adopt": [6, 8, 11], "adpt": [], "advanc": [0, 3, 5, 7, 8, 12, 13, 15, 16, 17, 18, 20, 21, 23, 28], "advantag": [3, 9, 16, 17, 25, 28], "adversari": [0, 14, 17], "advis": 15, "ae": [], "ae30dadffc90b9006d77af76b393cb9dfbfc9629f339fc1574a1c52e6806": [], "aed7a284c00dfa7c0682d14df85ad4955a350a21d2e3b06d8240497359bf": [], "aeiou": [], "aesthet": [0, 9], "af": [], "af0d1f58f86002be0cf1e2665cdd6f7a4a71cdc8a7a9438cdc9e3b5375f": [], "affect": [10, 21], "after": [8, 10, 11, 19], "afternoon": 5, "again": 21, "against": [12, 26], "agarw": [0, 18, 28], "agent": [19, 20], "aggreg": [0, 6, 9], "aggress": 16, "agostinelli": [0, 5, 18], "agrawala": [], "ahead": 2, "ahm": [], "ahmad": [0, 18], "ai": [0, 8, 10, 15, 16, 20, 22, 24], "ai4cc": [], "aidan": 0, "aiesha": [], "aila": [], "aim": [3, 13, 17], "aiobotocor": [], "aiofil": [], "aiohappyeyebal": [], "aiohttp": [], "aioitertool": [], "aiosign": [], "air": [0, 6], "aiti": [0, 6], "aittala": [], "ajai": 0, "ajit": [], "aka": 11, "akash": [], "akhgari": [], "akhil": [], "akkaya": [0, 18], "aksan": [], "akten": [], "al": [4, 8, 9, 24, 25], "alaluf": [], "alan": [], "alban": [], "albert": 0, "album": 5, "alcap": [0, 8, 9], "alec": [0, 18, 28], "alejandro": 0, "alek": [], "aleksand": [], "aleman": [0, 18], "alex": [0, 17, 18], "alexand": [0, 8, 27], "alexandr": [0, 18], "alexei": [], "algorithm": [0, 10, 13], "ali": [], "alia": [], "alias_free_torch": [], "align": [0, 2, 6, 8, 9, 13, 21, 23, 28], "all": [0, 2, 3, 6, 8, 9, 11, 12, 15, 19, 21, 26, 27], "allow": [2, 8, 9, 11, 15, 16, 19, 24, 28], "allud": 19, "almeida": [0, 18], "almost": [11, 13, 21, 23], "alon": 0, "along": [5, 7, 11, 27], "alongsid": [6, 8, 9], "alpha": [], "alphabet": 20, "alreadi": [10, 21, 28], "also": [2, 3, 6, 7, 8, 9, 11, 12, 13, 15, 16, 19, 20, 21, 23, 25, 28], "altenschmidt": [0, 18], "altern": [5, 8, 17, 18, 19, 21], "although": [8, 16, 19, 21], "altman": [0, 18], "alwai": [6, 8, 19], "amanda": [0, 28], "amaz": 24, "amazon": 5, "ambient": 28, "ambuj": [], "america": [], "american": [0, 23], "ami": [], "amir": [], "amirmojtaba": [], "amit": [0, 5, 8, 9], "amodei": [0, 18], "among": [5, 6, 8], "amount": 19, "amp": 10, "amplitud": 19, "amu": [7, 9], "an": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "anaconda3": 10, "anadkat": [0, 18], "analogi": [0, 5], "analys": 7, "analysi": [0, 23, 25, 28], "analyt": 23, "analyz": [3, 26, 27], "anandkumar": [], "anchor": 28, "and82": [0, 11], "anderson": 0, "anderson2016": [], "andi": [], "andr": [0, 6, 8], "andrea": [0, 5, 18], "andrew": [0, 17, 18], "andrii": [], "angela": [], "angelo": 0, "anger": 27, "ani": [2, 3, 4, 10, 11, 19, 21, 22], "anil": [], "anima": [], "animesh": [], "anirudh": [], "anjali": [], "ann": 15, "anna": 0, "annot": [0, 5, 7, 16, 18, 23, 27, 28], "annotated_typ": [], "annual": [0, 6, 8], "anoth": [4, 8, 19, 21], "ansel": [], "answer": [0, 3, 5, 6, 8, 16, 19, 22], "anthem": 24, "anticipatori": [0, 13], "antoin": [0, 5, 18], "antonio": [], "anygpt": 8, "anyi": [], "anyio": [], "anyon": 4, "anyth": [2, 11, 19, 20, 21, 22], "anytorch": [], "aouameur": 0, "ap": 26, "apach": 5, "apart": 28, "api": [0, 10, 19], "appdir": [], "appear": [8, 21, 22], "append": [11, 24], "appl": 15, "appli": [5, 12, 13, 15, 19, 21], "applic": [0, 3, 7, 9, 15, 17, 18, 19, 20, 21, 22, 27], "appreci": 4, "approach": [0, 2, 3, 6, 8, 9, 11, 13, 15, 17, 19, 21, 22, 23, 24, 25, 27, 28], "appropri": [12, 17, 25, 26, 27], "approx": 11, "approxim": 11, "ar": [0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "arab": [0, 8, 9], "arang": 24, "arash": [], "arbitrari": [21, 28], "arbor": 15, "architectur": [2, 3, 4, 7, 12, 13, 14, 16, 21, 22], "area": [12, 13, 15, 17, 19, 22, 25], "aren": [0, 5, 18, 19, 28], "argbind": [], "argpars": [], "aris": 3, "armi": [0, 6, 8], "around": [4, 8, 10], "arrai": [4, 24], "arrang": 28, "arriv": 19, "art": [0, 2, 8, 9, 13], "arthur": 0, "articl": 19, "articul": 13, "artifici": [0, 8, 15], "artist": [2, 13, 17, 23, 27, 28], "artsiom": [], "arun": [0, 18, 25], "arushi": [0, 8], "arxiv": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "ashish": 0, "ask": 19, "askel": [0, 28], "aspect": [8, 9, 12, 23, 28], "assess": [0, 6, 12, 26, 27], "assign": [9, 12, 13, 21, 28], "assist": 19, "associ": [0, 5, 6, 8, 9, 18, 27, 28], "assum": 8, "ast": 28, "asttoken": [], "atin": [0, 5, 8], "attempt": [8, 17, 23, 27], "attend": [19, 28], "attent": [0, 2, 4, 8, 11, 14, 19, 28], "attention_mask": [4, 24], "attr": [], "attribut": [16, 17, 18, 23, 27], "atzmon": [], "audio": [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 17, 18, 19, 20, 23, 24, 27], "audio_2023": [], "audio_base64": 5, "audio_byt": 5, "audio_embedding_dim": [4, 24], "audio_forward": 24, "audio_html": 5, "audio_project": 24, "audio_sampl": 10, "audiobench": [0, 6], "audiogen": [0, 13], "audioldm": [0, 13], "audiolm": [], "audioread": [], "audioset": 5, "audiotool": [], "audit": [], "auditori": 18, "augment": [0, 5, 8, 9, 11], "august": [0, 6, 8], "auraloss": [], "authent": 10, "author": 9, "auto": [0, 9, 14, 18], "autocast_mod": 10, "autoencod": [0, 11, 17, 19], "autom": [17, 20], "automat": [0, 6, 7, 9, 15, 17], "automodel": [4, 24], "autonom": 20, "autoregress": [8, 11, 13, 19, 22, 28], "autoregresst": 13, "autosav": 10, "autotoken": 24, "auxiliari": 5, "av": [], "avaiabl": 10, "avail": [4, 10, 17, 19, 24], "avent": 0, "avenu": 2, "averag": [6, 12, 25, 26], "avoid": 10, "aw": [0, 6], "awai": [5, 6, 11, 24], "awar": 9, "ax": [], "axel": [], "axi": [7, 11, 16], "ayan": [], "ayh": [], "azalea": [], "b": [0, 10, 19], "b1": [], "b161908e2f51be56568184aeb4a880fd287178d176fd1c860d2217f41106": [], "b2": [], "b3": [], "b4": [], "b6": [], "b64encod": 5, "b67ebd7e19ffe259f05d3cf4547326725c3113d640c277030be3e9998d6f": [], "b7": [], "b8": [], "b86984bed139586d01532a587464b5805f12e397594f19f931c4c2fbfa61": [], "b9": [], "b95df0b8593aee5d9e68b9a9f24e83c69657afb46b24f83b57098d926401": [], "b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc": [], "ba": [], "ba44652d562cbf0bf320e0f3810206149c8a4e99cdbf66da82e97ab53a15": [], "bach": [0, 13, 17, 18], "back": [10, 11, 13, 19, 20, 23, 27], "backbon": [8, 11], "background": 23, "backpropag": [], "backward": [4, 24], "bad": 2, "badlani": [0, 8], "bahjat": [], "bahri": [], "bai": [], "baid": [], "balaji": [], "balanc": [8, 26], "balog": [0, 18, 25], "bangkok": [0, 6, 8], "banjo": 24, "bao": [], "bar": [4, 11], "barn": [], "barret": [0, 18], "barrett": [], "barrington": [0, 17, 18, 27], "barron": [], "bart": 8, "barzilai": [], "base": [0, 3, 5, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "base64": 5, "baselin": [9, 19, 28], "bash": 10, "basi": 5, "basic": [4, 17, 18, 19, 21, 22, 24], "bass": 28, "batch": [4, 10, 24, 28], "batch_siz": [4, 24], "bay": [], "bb9ff095ae7b1b6908480f683b6ca6b71c2105d343a5e5cb25334b01f5fa": [], "bc": [], "bd": [], "bdt": [], "be958fefa589186b54daaa9a72fa1a2e19e42a2dcab87ee15c8273259da0": [], "beach": 28, "beat": [0, 5, 18, 24], "beatl": 28, "beauti": 4, "becaus": [11, 12, 16, 17, 19, 20, 21, 22, 24, 26], "becom": [2, 8, 9, 13, 16, 19, 20, 21, 27], "beeler": [], "been": [2, 9, 10, 11, 12, 13, 15, 19, 21, 22, 23, 27], "befor": [10, 11, 12, 19, 27], "began": 13, "begin": [18, 21, 24, 28], "behav": 19, "behavior": [4, 19, 23, 24], "behind": [12, 19], "being": [3, 9, 11, 16, 19, 21, 23, 26, 28], "believ": 15, "bell": [], "below": [5, 6, 8, 10, 11, 19, 21, 26, 28], "belt": 4, "ben": 0, "bench": [0, 6], "benchmark": 0, "benefit": [7, 9, 15], "beneto": [0, 5, 6, 8, 9, 15, 18, 23, 28], "bengio": [0, 17], "benjamin": [0, 8], "benno": [0, 5, 6, 8, 23, 28], "benzi": [], "berard": [], "berg": [0, 8, 15, 18, 28], "bergman": [], "bermano": [], "bernard": [], "bernhard": 0, "bert": [0, 6, 11, 18, 21, 22, 24, 28], "bertin": [0, 17], "bespok": 2, "best": [2, 3, 11, 15, 19, 27, 28], "beta": [], "beta_": 8, "bethard": [0, 5, 8, 9], "better": [4, 6, 8, 9, 11, 19, 20, 22, 23, 24, 25, 26, 28], "between": [3, 6, 7, 8, 9, 12, 13, 15, 17, 19, 20, 23, 25, 26, 27, 28], "beyond": [0, 9, 17, 18, 19, 25, 26], "bhe23": [], "bi": 28, "bia": [8, 20, 24], "bian": [], "bias": [12, 20], "bichen": [], "bidirect": [0, 11, 18, 28], "big": [11, 19], "bigger": [19, 20], "biggest": 20, "bigvgan": [], "bilei": [], "billion": 19, "bin": [0, 6], "binari": [26, 27], "bing": [], "bingchen": [], "biomed": [], "bit": [8, 10, 11], "bittner": [0, 8, 9, 18], "bj": [], "bjd": [], "black": [], "blank": [19, 21, 22], "blap": [0, 8], "blattmann": [], "bleach": [], "blend": [4, 13, 24], "bleu": 6, "bleu_1": 6, "blob": 10, "block": [8, 10, 11, 14], "blocker": 20, "blog": [0, 8, 11, 18], "blown": 24, "blue": [5, 13, 16], "blurri": 19, "bmv": [], "bnh": [], "bo": [], "bockkschlut": [], "bockkw16": [], "bodganov": [0, 5, 23], "bodi": 2, "boesel": [], "bogdanov": [0, 6, 8], "bohan": [], "boissier": [], "bokeh": [], "boldsymbol": [8, 11], "bolei": [], "book": [3, 4, 15], "booktitl": [], "boolean": [18, 27], "boost": 2, "bootstrap": [0, 8], "borgeaud": [], "bori": 0, "borrow": 6, "borso": [0, 5, 18], "bos_embed": 4, "bos_token_id": 4, "bosma": [0, 18], "bot": 20, "both": [2, 4, 6, 8, 9, 11, 12, 19, 25, 26, 27, 28], "botocor": [], "bottleneck": [13, 14], "bottom": 21, "boyer": [0, 9], "bpe": [21, 28], "bpm": 10, "braceexpand": [], "brahma": [0, 18], "bram": [], "brandon": [0, 9], "brass": 5, "braun": [], "break": [0, 2, 4, 6, 8, 11, 24, 28], "breakthrough": 13, "breathtak": 4, "brebisson": [], "bresson": [], "breviti": 6, "brian": [0, 17, 18, 27], "bridg": [0, 3, 5, 8, 9, 13, 18, 27, 28], "briefli": [6, 16, 19], "bright": 16, "bring": 19, "broad": [2, 11, 13, 22], "broadcast": 21, "broader": [23, 27], "brockman": [], "broken": 8, "brook": [], "broomel": [], "brownian_interv": 10, "brows": 27, "browser": [10, 24], "brox": [], "brualla": [], "bruno": 0, "bryan": [0, 5, 8, 18], "bsv": [], "btyld23": [], "budget": [8, 19], "build": [2, 11, 15, 23, 25, 28], "built": [11, 24, 27, 28], "bulid": 24, "bunch": 19, "burcu": [], "burgeon": 15, "burovski": [], "byte": [5, 21, 28], "bytecod": [], "byted": 15, "c": [0, 6, 8, 11, 16, 17, 19], "c1": [], "c13ea695a4393639830bf96baea956538ba7a9d06fcce7cef10bfff20f72": [], "c188ac517f402775b90d6f312955a5e53b866c964b32119f2ed76315697": [], "c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6": [], "c2": [], "c316262244abea7481f95f1e91d7575f3dfcf6455d56d1ffe9839c582eb1": [], "c4": [], "c463dc5fc02fbe019566d067a9d18746cd3c664f29c9b8b3c3f9ed025365": [], "c4dm": [], "c5": [], "c6": [], "c691e6c5d925a364d63eec27d1f10477ca7902febe10a8e1f86284dba754": [], "c869a1fbd481dcb02c70032fd6a7243de7582bc48c7cae03d6f0985a11c0": [], "c8bfa8cbcd3ea1d25d2beb359b5c5a3f4339a7e2e5d9e3ef3e29ba3ab3b9": [], "c9b96572ab7994e73c64588f8875741823f2daba70e746547fff9a2d9a54": [], "ca": 15, "cacer": 0, "cach": [], "cacul": 12, "cai": [], "caillon": [0, 5, 18], "calcul": [12, 19, 21, 26], "california": [15, 28], "call": [4, 8, 10, 11, 16, 21, 22, 24, 25], "cambridg": [], "came": 28, "campaign": 20, "can": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "cancel": [], "candid": [6, 15], "cangea": [], "cannot": [9, 11, 17, 23, 27, 28], "cao": 0, "cap": [4, 24], "capabl": [0, 3, 8, 13, 15, 17, 19, 23, 25, 28], "capac": 4, "capit": 11, "caption": [0, 2, 4, 5, 6, 7, 8, 11, 15, 16, 18, 24, 28], "caption2emb": 24, "captiosn": [], "captiv": [4, 24], "captur": [4, 6, 7, 8, 9, 17, 22, 23, 24, 25, 28], "carbonneau": 0, "care": [8, 23, 25, 26], "carefulli": [20, 26, 28], "carlo": [], "carnovalini": [], "carol": [0, 5, 8, 9], "carr": 0, "carri": 8, "carrol": [0, 18], "casagrand": [], "cascad": 17, "case": [2, 5, 7, 8, 9, 10, 11, 12, 21, 23], "caseb": 0, "casei": [], "cast": 8, "casual": 5, "cat": [4, 16, 24], "catalog": 27, "catanzaro": [0, 8], "catchi": 24, "categor": [9, 21], "categori": [2, 9, 10, 11, 26, 28], "cater": 15, "caus": 11, "causal": [21, 28], "cb": [], "cc": [4, 5, 24], "cc3a402a6439c15c3d4294333e13042b915bbeab54edc457c723931fed3f": [], "ccf007edf442c3c0cd3a98be2c82bc99edc957c04436a759b6e1e01077e0": [], "cck": [], "cd": 15, "cd10c82398f3b39bbf60a300e09c931bdf6844f3f2fba9ab2b5981501f9f": [], "cdescrivan17": [], "cdot": [11, 28], "cdz": [], "ce": 0, "ce21": [0, 2], "ce6964e9f8822f6e63ebc59bdcc5ae445126b7356da63188fa0e6265054": [], "cell": [4, 10, 16, 24], "celma": [0, 17], "celso": [], "cem": [], "center": 4, "centr": 15, "central": [19, 26], "certain": [19, 21, 22], "certainli": 11, "certifi": [], "cf": [], "cffi": [], "cfg": 10, "cfg_scale": 10, "cfs16": [0, 8], "cfsc17": [0, 9], "chaganti": [0, 18, 25], "chakrabarti": [], "challeng": [3, 6, 7, 13, 15, 17, 18, 19, 22, 27, 28], "cham": [], "chan": [], "chanan": [], "chang": [0, 2, 6, 8, 18, 19, 22, 28], "changli": [0, 8], "changx": [], "changyou": [0, 8, 9], "channel": [2, 11, 17, 19], "chao": [0, 8], "chaowei": [], "chaoyu": [], "chapter": [3, 16, 18, 22, 23, 25], "character": [], "characterist": [4, 6, 8, 9, 23, 27, 28], "charli": [], "charset": [], "chartmetr": 15, "chat": [22, 25], "chatgpt": [16, 19, 22], "chatthe": [], "chaudhuri": [], "chauhan": [], "che23": [], "cheaper": 21, "cheapli": 19, "cheat": 12, "chechik": [], "check": [4, 10, 11, 16, 24], "chelsea": [], "chemistri": [], "chen": [0, 5, 6, 8, 9, 15, 18, 23, 28], "chenchong": [0, 8], "cheng": 0, "chenji": 0, "chenlin": [], "chenshuo": [0, 5, 8], "chenyang": [], "chet": 0, "cheung": 0, "chia": 0, "chiang": 0, "chieh": 0, "child": [0, 18], "chinchilla": 19, "ching": [], "chintala": [], "chitwan": [], "chiu": [], "chiyuan": [], "chl": [0, 18], "cho": [0, 9], "cho_unifying_2021": [], "choi": [0, 8, 9, 17, 18, 23, 25, 28], "choic": [4, 6, 8, 19, 24, 28], "chong": [0, 18], "chongxuan": [], "choos": 8, "choppi": 5, "choral": [0, 13], "chord": [2, 5], "choru": 2, "chosen": 9, "chou": [0, 17, 18], "chourdia": [], "chri": [0, 17, 18, 28], "christian": [], "christin": [0, 18], "christina": [], "christoph": 0, "chronolog": 11, "chu": [0, 6, 8], "chu_qwen": [], "chul": [], "chun": [], "chung": [0, 18], "cider": 6, "cinjon": [0, 17], "circul": 2, "circumv": 11, "cite": [], "citep": [], "cites": [], "citi": [0, 5, 8, 9], "cj": 0, "ck": [], "ckg": [0, 13, 14, 18], "ckm": [], "ckp": [], "clamp": [10, 24], "clap": [11, 12], "clariti": 11, "clark": [0, 28], "class": [2, 16, 21], "classic": [4, 5, 11, 16, 24, 27], "classif": [0, 3, 7, 12, 17, 18, 19, 27, 28], "classifi": [9, 10, 27], "claud": 19, "clean": [10, 11], "clean_fid": [], "clean_up_tokenization_spac": [4, 24], "cleaner": 11, "cleanli": 11, "clear": [6, 8, 9, 11], "clever": [11, 19], "clich\u00e9": 13, "click": [], "client": [], "clip": [4, 8, 9, 10, 15, 16, 19], "clip_anytorch": [], "clone": [4, 15], "close": [6, 8, 13, 19, 28], "closer": [5, 28], "closest": 17, "cloud": 8, "clpn19": [0, 18, 28], "cluster": [], "clz": [0, 18, 25], "cn24": [], "cnn": 8, "co": [0, 8, 10, 12, 15, 28], "coars": [2, 9], "coca": [0, 16], "code": [0, 3, 11, 15, 18, 19, 28], "codec": [11, 12, 18], "coeffici": 11, "cohen": [], "coher": 25, "col10": [], "col12": [], "colab": [4, 24], "colin": [0, 18], "collab": 10, "collabor": 28, "collect": [0, 7, 17, 19, 20, 27], "colleg": 15, "collin": [], "colloqui": 23, "color": 16, "colorcet": [], "com": [0, 4, 10, 15, 24], "combin": [3, 6, 8, 11, 12, 14, 15, 24, 27, 28], "come": [6, 8, 9, 11, 19, 20], "comm": [], "command": 10, "common": [4, 6, 7, 8, 19, 21, 28], "commonli": [6, 12], "commun": [2, 15, 16, 17, 28], "compani": 19, "companion": [], "compar": [6, 7, 12, 13, 16, 17, 18, 19, 21, 25, 26, 27, 28], "comparison": [6, 19, 26], "compat": [], "compil": [], "complement": 12, "complementari": 26, "complet": [3, 11, 15, 19, 23, 24, 26, 28], "complex": [3, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 27, 28], "compon": [3, 4, 6, 8, 18, 19, 21, 25, 28], "compos": [2, 8, 13, 25], "composit": [0, 13, 15], "comprehend": [0, 8, 15], "comprehens": [0, 3, 6, 12, 15, 26, 27, 28], "compress": [0, 11], "compris": [], "comput": [0, 5, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 21, 26, 27, 28], "computation": [2, 21], "concaten": [2, 8, 11, 14], "concept": [4, 9, 11, 13, 16, 18, 19, 24, 26, 28], "conceptu": 11, "concern": [11, 19], "conclud": [3, 11], "conda": [10, 15], "condens": 11, "condit": [0, 2, 3, 4, 9, 10, 14, 17, 18, 19, 22], "conduct": [3, 23], "confer": [0, 5, 6, 8, 9, 15, 17, 18, 25, 28], "confid": 12, "config": [], "configpars": [], "configur": [], "cong": [], "congratul": [3, 24], "connect": [8, 17, 21, 24, 28], "connectionist": 0, "connelli": [], "consecut": 22, "consensu": 6, "consequ": [8, 28], "consid": [4, 6, 9, 12, 20, 22, 23, 26, 28], "consider": [2, 26], "consist": [0, 4, 5, 8, 13, 14, 15, 17, 28], "consistut": [], "consolid": 8, "constabl": [], "constant": [5, 19], "constantin": [0, 8], "constitu": 28, "constrain": [9, 28], "constraint": 13, "construct": [4, 28], "consum": 12, "consumpt": 20, "contain": [2, 5, 9, 15, 23, 25, 27], "contemporari": [4, 24], "content": [0, 2, 3, 4, 5, 7, 9, 15, 17, 23, 24, 27, 28], "context": [6, 8, 11, 20, 21, 23, 25, 28], "contextu": [6, 16, 19, 21, 22, 23, 25, 28], "contigu": 4, "continu": [3, 6, 12, 14, 19, 21, 22], "contourpi": [], "contrast": [0, 3, 12, 16, 18, 19, 24, 28], "contribut": [13, 15], "contributor": [], "control": [0, 5, 11, 13, 15, 17, 18, 19, 20, 22, 28], "controlnet": [0, 2, 18], "convei": [4, 9, 13], "conveni": 6, "convent": [15, 25], "converg": 0, "convers": [0, 3, 6, 8, 12, 15, 18, 22, 23], "convert": [10, 11, 14, 18, 28], "convert_tokens_to_id": 4, "convolut": [0, 9, 11, 14], "cooijman": [], "cook": [0, 9], "copet": [0, 18], "copi": 4, "copyright": 5, "core": [5, 8, 11, 28], "corner": 12, "corpora": [16, 28], "corpu": [0, 5, 21, 22, 23], "corpusid": 0, "corr": 0, "correct": [4, 11, 19], "correctli": [19, 26], "correl": [2, 6], "correspond": [9, 11, 12, 19, 21], "corrupt": [11, 28], "cosin": [6, 12, 26], "cosmo": 0, "cost": [11, 12, 19, 20, 27], "costli": [], "cot": 19, "could": [11, 17, 21, 23, 25, 27, 28], "couldn": [], "count": [8, 22, 26], "countri": 24, "coupl": 7, "cours": [8, 19], "courvil": [0, 17], "cover": [7, 9, 15, 16, 17, 21, 22, 23, 27], "coverag": [23, 27], "cp26": [], "cp27": [], "cp311": [], "cp32": [], "cp33": [], "cp34": [], "cp35": [], "cp36": [], "cp37": [], "cpcd": 25, "cpjku": [], "cpp": [], "cpu": [4, 10, 24], "cqt": [], "crawl": 5, "creat": [2, 3, 5, 6, 7, 12, 15, 17, 19, 23, 25, 26, 27, 28], "create_audio_html": 5, "creation": [0, 13, 15, 18], "creativ": [0, 2, 8, 13, 17], "cref": [], "criteria": [12, 23, 26, 27], "criterion": 4, "critic": [20, 26, 27, 28], "crop": 2, "cross": [2, 8, 11, 14, 19, 28], "cross_entropi": [4, 24], "crossentropyloss": 4, "crowdsourc": 5, "crucial": [9, 12, 26, 28], "csl": 13, "csrc": [], "cuda": [4, 10, 24], "cue": [13, 18], "cultur": [20, 23, 28], "cun": [], "curat": [0, 18, 25], "current": [2, 3, 6, 8, 9, 10, 13, 18, 21, 23, 25], "curti": 0, "curv": 11, "custom": [2, 5, 19, 24], "cut": 18, "cutoff": [20, 26], "cvf": [0, 5], "cvpr": [0, 5, 15], "cvpr52688": 0, "cvpr52729": [0, 5], "cvsf23": [], "cwbergkirkpatrickd20": [0, 13], "cwbkd20": [], "cwl": [0, 11, 13, 18], "cxh": [], "cxz": [0, 8, 28], "cxzg16": [], "cyclegan": [], "cycler": [], "cyran": 0, "cyril": [0, 17], "czj": [0, 13], "d": [0, 4, 10, 11, 15, 18, 19, 24], "d1": [], "d110f0a43beb365758a252203c43eaaad169fe7749da918869a8c991f726": [], "d1e337b9b4c8ea3aae5d399ace8c9cf4c2a7789cfe9d14766511fbc83c8b": [], "d2": [], "d23a97e0a2c690d40b165d1062e2c4ccc796be458a1ce59f6ba030434663": [], "d2805324fb746d8da86d3844bee4f55c0cfd6c136de61b713772d44c5bea": [], "d3": [], "d4": [], "d497a310bde3f01cb805196ac61b7ad6dc5dcf8dce66634dc34364b20b4f": [], "d5": [], "d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08": [], "d8": [], "d9": [], "d_": 8, "d_c": 11, "d_h": 11, "d_k": [], "d_t": 11, "d_w": 11, "da": [], "dabeaf902892922777492e1d253bb7e1264cadce3cea932f7ff599e53fea": [], "dac": 11, "dacheng": [], "daeyong": [0, 23, 25], "dahl": [], "dai": [0, 5, 8, 9, 18, 28], "daiq": [], "dall": 19, "damien": [], "dan": [], "danc": [2, 5, 16], "danceabl": 5, "dang": [], "daniel": [0, 5, 8, 9, 18, 28], "danilo": 0, "dannenberg": [], "dao": [], "dao23": [], "dara": [], "dario": [0, 18], "dark": 16, "dasaem": [0, 28], "data": [0, 3, 5, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 21, 23, 25, 27], "databas": [17, 19, 24, 26, 27], "datafram": 5, "dataload": [4, 24], "dataset": [0, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 23, 25, 27, 28], "date": [13, 19, 23], "dateutil": [], "daunt": 20, "davi": [], "david": [0, 17, 18, 27], "dawen": [], "dazhong": [], "db": 24, "db99aa669eee301966bc6c997d60a0240f9cecae63f044b2e5a5310e4bf7": [], "dbvb17": [], "dc39062efec7515add304b98a54da2948709a808": [], "dcd": [0, 13], "dck": [0, 23, 25], "dcln23": [0, 8, 18], "dclt18": [0, 18, 28], "dcr": [0, 2], "dcsa22": [0, 11], "dctorch": [], "dd": [], "ddp09": [], "ddpm": [0, 2], "ddsp": [0, 13], "de": 11, "de3276d773ab6ce3ad676df5fab5aac19696b2956319d65d7dd88fb10f19": [], "deadlock": 10, "deaf": 7, "deal": [8, 9, 11, 21, 27], "decemb": [0, 8, 9], "decid": 19, "decis": 19, "decod": [3, 4, 5, 11, 14, 17, 18, 24], "decompos": 21, "deconvolut": 14, "decor": [], "decreas": 19, "dedic": [10, 11], "deep": [0, 4, 7, 8, 9, 12, 13, 15, 17, 18, 22, 24, 28], "deepak": [], "deepanwai": [0, 5], "deepbach": [0, 13], "deeper": [11, 15, 19, 25], "deepfak": 20, "deepli": 17, "deepmind": 15, "def": [4, 5, 24], "default": [4, 10, 24], "defferrard": [], "defin": [2, 4, 8, 10, 11, 18, 21, 22, 24, 26], "definit": [8, 18, 22], "defossezcsa23": [0, 14], "degara": [], "degre": [], "dehghani": [0, 18], "dekel": [], "delet": 10, "delic": 4, "delight": 3, "deliv": [4, 24], "delta": 28, "delv": [15, 18], "demo": [0, 6, 8, 10], "demonstr": [3, 14, 16, 19, 25, 28], "den": [0, 17, 28], "deng": [0, 5, 8, 9], "dengsheng": [0, 17], "denk": [0, 5, 18], "denois": 0, "denot": [8, 9, 11, 22, 28], "dens": [8, 28], "densiti": [2, 11], "denton": [], "depart": 15, "departur": 15, "depend": [6, 8, 9, 10, 11, 12, 15, 19, 21, 22, 28], "deploi": [], "depract": [4, 24], "depth": [3, 18], "deriv": [5, 11, 12, 15], "desc": [4, 24], "descent": 22, "describ": [0, 2, 4, 5, 7, 8, 9, 16, 23, 27], "descript": [0, 3, 6, 13, 15, 16, 19, 23, 24, 27, 28], "descript_audio_codec": [], "descript_audiotool": [], "description_evalu": [], "description_model": [], "description_models_t": [], "description_task": [], "descriptor": 9, "deserv": [], "deshmukh": [0, 8], "desideatum": 12, "design": [2, 4, 6, 7, 8, 9, 10, 11, 12, 23, 28], "desir": [17, 19], "desktop": 19, "desmaison": [], "despit": [17, 21], "dessert": 3, "desw23": [0, 8], "detach": [4, 24], "detail": [2, 4, 6, 8, 9, 12, 13, 15, 16, 18, 21, 22, 24, 28], "detect": [20, 27], "determin": [11, 19, 26], "develop": [0, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 22, 25, 27, 28], "devi": 0, "devic": [4, 10, 24], "device_typ": 10, "devin": [], "devis": 8, "devito": [], "devlin": [0, 18, 28], "df": 5, "df18d492a8f00d29a30db307904b9b296e37507034eedb523876f3a2e13": [], "df4b9b42f2be0b623cbd5e2140cafcaa2bef0759a00b7b70104dcfe2fb51": [], "df630c387a0a054815d60be6a97eb4e8f17385d5d6fe660e1c02750062b4": [], "dhabi": [0, 8, 9], "dhariw": [0, 18], "dhyy18": [0, 13], "di": 0, "dialog": 18, "dialogu": [0, 6, 7, 8, 9, 15, 23, 25], "dickstein": 0, "dict": [], "did": [8, 21], "diederik": 0, "diego": 15, "dieleman": [0, 11, 17], "diff": [0, 2], "differ": [4, 5, 6, 7, 8, 9, 11, 12, 14, 17, 23, 24, 26, 27, 28], "differenti": [0, 2, 8, 11], "differnt": [], "difficult": 19, "difficulti": [16, 19], "diffus": [0, 2, 3, 10, 13, 16, 18, 19], "diffwav": [], "dig": 22, "digit": 0, "dim": [4, 24], "dimens": [11, 21, 23], "dimension": [11, 28], "dimitra": [], "dinculescu": 0, "ding": [], "dinh": [], "diogo": [0, 18], "direct": [3, 7, 8, 11, 15, 19, 23, 28], "directli": [2, 5, 6, 8, 10, 11, 13, 19], "disabl": 10, "disadvantag": 3, "discard": 23, "discount": 6, "discov": [23, 25, 27], "discoveri": [0, 15, 23, 25], "discret": [0, 2, 3, 8, 11, 14, 18, 19, 21], "discrimin": [11, 14, 17, 19, 28], "discuss": [2, 3, 7, 8, 9, 11, 12, 15, 18, 19, 20, 25, 28], "dispatch": 19, "displai": [4, 5, 10, 24], "dispos": 8, "dissimilar": 4, "dist": [4, 24], "distanc": [0, 24, 26, 28], "distil": 0, "distinct": [12, 13, 27, 28], "distinguish": [0, 4, 7, 8, 17, 18, 28], "distribut": [2, 5, 8, 11, 12, 17, 19, 21, 22], "dit": 11, "ditto": [0, 2, 15, 18], "div": 10, "diverg": 12, "divers": [0, 15, 18, 23, 25], "dixon": 0, "djgd21": [], "djp": [0, 13, 18], "dkb14": [], "dl": 8, "dljn24": [0, 28], "dmitri": [0, 5, 6, 8, 23], "dmitrii": [], "dml": [0, 5, 8, 9], "dmp18": [], "dmp19": [0, 17], "dn21": [0, 11], "do": [0, 2, 6, 8, 11, 15, 19, 21, 22], "do_sampl": 4, "doc": 10, "docker": [], "docker_pycr": [], "dockhorn": [], "docnam": [], "docstr": [], "docstring_pars": [], "doctor": 15, "document": [0, 4, 9, 24], "doe": [6, 9, 10, 11, 12, 19, 21, 24], "doesn": [2, 11, 19, 23], "doh": [0, 5, 8, 15, 18, 23, 25, 28], "doi": [0, 5, 6, 8, 9], "domain": [0, 2, 3, 4, 6, 8, 11, 12, 13, 14, 15, 16, 17, 21, 28], "domin": 4, "dominik": 0, "don": [3, 8, 10, 19, 21, 23, 27], "donahu": [0, 17, 18], "donald": [], "done": [6, 10, 11], "dong": [0, 5, 8, 9], "dongchao": [], "dongdong": [], "dongjun": 0, "dongt": [0, 8], "dorien": [0, 5], "doshi": [], "dot": [9, 12, 28], "dougla": [0, 9, 17, 18, 27], "down": [11, 28], "downbeat": [], "download": [10, 24], "downsampl": 11, "downstream": [15, 16], "dpm": [], "dpmpp": 10, "dpo": 19, "dramat": [25, 28], "draw": 12, "drawback": 9, "dreambooth": [], "dreamfus": [], "drift": 11, "drive": 13, "driven": [0, 8, 17, 18], "drop": [0, 28], "drop_last": [4, 24], "dropout": 28, "drum": 2, "dsdb16": [], "dtype": [4, 24], "du": [0, 18], "duan": 0, "dubei": [], "dubnov": [0, 8, 9, 18, 28], "duc": 0, "due": [13, 20, 21, 25], "duet": 24, "duh": [0, 5, 8, 9], "dumoulin": [], "dung": [], "durand": [0, 8, 9, 18], "durat": [10, 13], "dure": [12, 15, 18, 19, 27, 28], "dvdos18": [], "dwcn23": [0, 16, 18, 28], "dylan": [], "dynabert": [], "dynam": [0, 13, 19, 24, 28], "d\u00e9fossez": 0, "e": [0, 2, 4, 5, 8, 9, 11, 16, 19, 21, 22, 25, 26, 27, 28], "e0": [], "e07ce413d16ef64e885bea37551eac4c5ca3ddd440933f9c94594273d0d9": [], "e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4": [], "e1127810de8b60a58bfa682f858fd7ba36667d29c0b9ad3b6ff10d6cb944": [], "e1956f7ca582a22dd1f17b9e26fcb8229051b0ce6d33b47227824772feec": [], "e2": [], "e3": [], "e4": [], "e5": [], "e7": [], "e8": [], "e8c04e80e82391a6e51f218ca49720f64236bc824e92152a2633b74cf7ab": [], "e9": [], "e9fcff7623954d86bdc17782036cbf715ecab1bec4847c008557affe1ca8": [], "e_": 8, "ea": [], "each": [6, 8, 9, 11, 12, 17, 19, 21, 23, 25, 26, 27, 28], "ead346e904390a53e71b5da2df7e7839abb16e967ba07fa15addf1f9f37c": [], "earli": [8, 13, 19, 28], "earlier": [8, 19, 26], "earliest": 8, "easi": 8, "easier": [7, 17, 19], "easili": 17, "easy_gener": 10, "eb": [], "ebnj33fcrl": 0, "ec": [], "ecal": [5, 28], "eck": [0, 17], "econom": 20, "economi": 20, "ect": [0, 11, 13], "ed": [], "edg": 18, "edgar": [], "edict": [], "ediff": [], "edit": [0, 2, 5], "editor": [0, 5, 6, 8, 9], "edmsound": 0, "educ": [7, 15], "edward": [], "edwin": [], "ee": [], "ee39c6e92acc742c052f137b47c210cd0a1b72dcd3f98495528bb4d27761": [], "eerili": 11, "eess": [0, 6, 8], "effect": [0, 5, 12, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "effici": [0, 8, 9, 11, 13, 15, 28], "effort": [20, 25], "efro": [], "egregi": 20, "ehgr20": [0, 13], "ehohc": [], "ehsan": [], "eikan": [], "einop": 10, "einops_ext": [], "einsum": 24, "either": [6, 8, 9, 10, 11, 19, 28], "elabor": 12, "elbmg07": [0, 17], "electr": 24, "electrifi": [4, 24], "electron": [0, 5, 16, 28], "element": [9, 10, 24, 28], "elena": [0, 8, 9], "eleph": 2, "eleventh": 0, "eli": [], "elia": [], "elio": [0, 6, 8, 9, 15, 18, 28], "elizald": [0, 8], "ell": 11, "elli": [0, 18, 28], "ellison": [], "eloi": [], "els": [4, 10, 16, 24], "elsen": [], "elucid": [], "ema": [], "ema_pytorch": [], "emanuel": 0, "emb": [4, 11], "embed": [0, 2, 3, 4, 6, 8, 11, 12, 14, 16, 18, 19, 21, 23, 24, 25, 26], "embedding_cat": 4, "embedding_prefix": 4, "embedding_text": 4, "embeddings_2d": 16, "emed": 28, "emerg": [6, 8, 9, 13, 15, 19, 25], "emili": [], "emilian": 0, "emir": [0, 8, 9], "emmanouil": [0, 5, 6, 8, 9, 15, 18, 23, 28], "emmanouilid": [], "emnlp": [0, 8, 9], "emot": [0, 4, 9, 13, 20, 24], "emphas": [18, 25, 26], "emphasi": [15, 17], "empir": [0, 8, 9, 19], "emploi": [8, 14, 20, 21], "emr": [], "emu": [], "en": 10, "enabl": [4, 5, 7, 8, 13, 15, 16, 18, 19, 23, 24, 25, 27, 28], "enchant": 4, "encod": [3, 4, 11, 13, 14, 17, 18, 24, 28], "encodec": [11, 14, 19], "encompass": [9, 23], "encount": 28, "encourag": [11, 15, 19, 28], "end": [0, 5, 6, 11, 17, 21, 24], "endeavour": 6, "energet": [24, 28], "energi": 20, "enforc": 21, "engag": 25, "engel": [0, 5, 17, 18], "engin": [15, 17, 20], "english": [5, 19, 20, 21], "enhanc": [0, 3, 5, 8, 9, 13, 18, 19, 28], "enjoi": 9, "enorm": [], "enough": [4, 19], "ensembl": [], "ensur": [4, 12, 26, 28], "enter": 17, "entir": [8, 9, 11, 12, 19, 23, 28], "entiti": 28, "entropi": [12, 14, 28], "enumer": 16, "env": 10, "envinro": 10, "environ": [10, 15], "eos_token_id": 4, "eot": 21, "ep": [], "epc": [0, 2, 11], "epoch": [4, 24], "epoch_loss": [4, 24], "epstein": [], "epur": [0, 8, 9], "equal": [4, 6], "equat": [0, 11], "equilibrium": 0, "equit": 20, "equival": 19, "er": 20, "era": [9, 15], "eri75": [], "eric": [], "erich": [], "erickson": [], "erik": [0, 9], "ermon": 0, "err": [0, 13, 17], "error": 19, "escap": 5, "escriv": [], "esl": 0, "especi": [9, 11, 19, 20, 21], "essenti": [11, 12, 18, 19], "esser": [], "establish": [8, 9, 15, 26], "estim": [0, 27], "et": [4, 8, 9, 24, 25], "eta": [], "etc": [11, 16, 21, 27], "ethan": 0, "euclidean": 26, "eugen": [], "eunggu": [], "evad": [4, 24], "eval": [4, 24], "evalu": [0, 3, 5, 7, 8, 9, 13, 15, 18, 19, 23, 24, 28], "evan": 0, "even": [2, 4, 6, 9, 12, 19, 28], "event": [9, 23], "ever": 24, "everi": [8, 19, 21], "evgeni": [], "evolut": [3, 7, 9, 18, 23, 25], "evolv": [13, 17, 23, 28], "exact": [6, 8, 11], "exactli": [21, 24], "exam": 21, "examin": [3, 18, 28], "exampl": [2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15, 16, 19, 21, 23, 25, 27, 28], "excel": [12, 16], "except": 21, "excit": [2, 3, 16, 24], "exclus": 23, "execut": 19, "exemplifi": [13, 20], "exercis": [2, 18], "exhibit": 20, "exist": [2, 5, 7, 8, 11, 16, 17, 19, 28], "exit": [], "exp": [8, 24, 28], "expand": [4, 5, 13, 27, 28], "expect": 9, "experi": [0, 4, 9, 23, 24, 25, 28], "experiment": [0, 28], "expert": [], "expertis": 15, "explain": [7, 15, 27], "explicit": 23, "explicitli": 10, "exploit": [], "explor": [0, 3, 9, 13, 15, 18, 23, 24, 25, 28], "export": 10, "expos": 28, "express": [0, 2, 9, 13, 16, 23, 25, 28], "expressivenss": 13, "ext": [], "extend": [0, 8, 24], "extens": [19, 21], "extern": [0, 19], "extra": 11, "extract": [2, 4, 11, 24, 28], "extractor": 8, "extrem": 21, "f": [0, 4, 5, 6, 9, 10, 11, 24, 28], "f0": [], "f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588": [], "f185bfd0ca1d213beb4293bed51d92254df23d8ceaf6c0e17146d508a776": [], "f2": [], "f2b75d2fc6f1a260f340f0e7c6a060f4dd2961cc16884ed851b0d18da06a": [], "f4": [], "f5": [], "f6": [], "f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054": [], "f7": [], "f7e21b113dd48a9c97d364e0915b3988c6a0b6207652f5a92372871b7aa4": [], "f9": [], "f9d7fe80a8fcce9bb128d1381c6fe41a8d286d7e18395e273002e8e0fa34": [], "f_": [8, 11], "fa": [], "fabien": [0, 28], "face": [7, 13, 24, 25, 27], "facilit": 15, "fact": [8, 11], "factor": [0, 26], "fadernet": [], "fail": [2, 6, 19, 23], "failur": 2, "fair": [20, 26], "fall": [8, 27], "fals": [4, 5, 10, 24, 26], "familiar": [4, 24], "fan": 4, "fandong": [], "fang": [], "fantast": 11, "far": [2, 4], "farid": [], "fashion": [9, 23], "fast": [0, 5], "fastapi": [], "fastcor": [], "faster": 19, "fatigu": 12, "favor": 2, "favour": 8, "fazeka": [0, 6, 8, 9, 15, 18, 28], "fb": [], "fc": [], "fd": [], "feasibl": 4, "featur": [0, 2, 4, 5, 7, 8, 9, 10, 11, 14, 17, 19, 21, 23, 24, 27, 28], "fed": 21, "federico": [], "fedu": [0, 18], "feed": 19, "feedback": [0, 3, 12, 18, 23, 25], "feel": 5, "felix": [0, 18], "femal": [4, 24, 28], "feng": [], "ferjad": [], "fernando": [0, 25], "few": [0, 2, 4, 6, 8, 10, 11, 19, 26], "fewer": 26, "ff": [], "ff642e65ad6b90db43e668d70ffb6736436c7ce41fcc549f4e9472234127": [], "ffbf7a134b9ab11a67b0cf0726453cedd9c5043a4fe7a35d1cefa9a1bcfb": [], "ffmpy": [], "fid": [], "fidel": [0, 20], "fidler": [], "field": [3, 13, 15, 19, 20, 22, 28], "figsiz": 16, "figur": [13, 14, 16, 21], "file": [10, 24], "filelock": [], "filip": [0, 18, 25], "filippo": [], "fill": [19, 21, 22], "film": [7, 11, 21], "filter": [17, 27, 28], "filterwarn": [10, 16], "final": [4, 6, 7, 9, 11, 12, 13, 15, 19, 27], "find": [0, 4, 5, 7, 8, 9, 17, 19, 20, 23, 24, 26, 27], "fine": [0, 2, 5, 8, 9, 11, 19, 28], "finetun": [0, 8, 11, 18], "finit": 21, "finnicki": 10, "fire": [], "firmli": 15, "first": [4, 6, 8, 9, 10, 11, 12, 13, 15, 17, 19, 21, 23, 24, 26, 27, 28], "fisch": [], "fischer": [], "fit": [2, 8, 11], "fit_transform": 16, "fix": [8, 9, 11, 17, 19, 21, 28], "fjeld": [], "flamingo": [0, 8, 19], "flash": [], "flashattent": [], "flat": 12, "flatten": 24, "flatten_dict": [], "flavio": [], "fleet": [], "flexibl": [3, 9, 15, 16, 19, 21, 22, 27, 28], "flexibli": 19, "float": 11, "float32": 10, "flore": 0, "florencia": [0, 18], "flori": 0, "florian": [], "flow": 0, "fltz10": [0, 17], "fm": 24, "fm22": [0, 11], "fma": 8, "fn": 26, "focu": [2, 6, 9, 11, 13, 15, 17, 23, 25, 28], "focus": [3, 6, 9, 13, 15, 16, 17, 18, 21, 26, 28], "folk": 24, "follow": [0, 2, 3, 4, 7, 8, 10, 11, 12, 13, 14, 17, 18, 19, 21, 26, 28], "fontsiz": 16, "fonttool": [], "foot": 22, "forc": 23, "foreign": [4, 24], "forget": 21, "forgo": 8, "fork": 10, "form": [2, 5, 6, 7, 8, 9, 11, 16, 19, 27], "formal": [2, 11, 23], "format": [4, 6, 19, 22, 28], "former": 8, "formul": [7, 18, 23, 28], "forsgren": 0, "forth": [23, 24], "forum": [0, 8], "forward": [4, 11, 13, 14, 24], "fossez": [0, 18], "foster": 15, "found": [8, 19], "foundat": [0, 4, 8, 9, 15, 16, 18, 24, 28], "four": 26, "fourier": 0, "fp": 26, "fr": 0, "frac": [8, 26, 28], "fragkiadaki": [], "frame": 4, "framework": [3, 4, 8, 15, 18, 19, 22, 27, 28], "fran": 0, "franci": [], "francisco": 15, "francoi": 0, "frank": 0, "frechet": [], "freder": [], "fredo": [], "free": [0, 2, 4, 10, 24], "freedman": [], "freedom": [], "freeman": 0, "freeu": [], "freez": [4, 24], "freeze_backbone_model": 4, "freeze_parma": [4, 24], "french": 19, "fresh": 23, "fri": [], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28], "from_pretrain": [4, 24], "frontier": 19, "frozen": [8, 19], "frozenlist": [], "fsspec": [], "ftfy": [], "fu": [0, 8, 17], "full": [2, 9, 13, 19, 23, 24], "fulli": [2, 5, 8, 10, 11, 20, 23], "function": [2, 4, 5, 8, 11, 13, 14, 24, 26], "functool": 10, "fundament": [17, 25, 26, 28], "furkan": 0, "further": [8, 9, 11, 13, 15, 18, 19, 22, 28], "furthermor": [16, 18, 28], "fuse": [], "fusion": [0, 28], "futga": [0, 5, 8, 9], "futur": [3, 15, 17, 20, 21, 23, 28], "futurewarn": [4, 10, 24], "g": [0, 2, 4, 5, 8, 9, 10, 11, 19, 22, 25, 26, 27, 28], "g_": 11, "ga": 0, "gaa": [], "gabbolini": [0, 8, 9], "gabriel": [0, 18, 28], "gain": 28, "gal": [], "game": [], "gamma_": 11, "gamper": [], "gan": [0, 11, 19, 21], "gang": [0, 8], "ganguli": [], "ganti": [0, 18, 25, 28], "gao": [0, 9], "gap": [3, 23, 27, 28], "garcia": 0, "gardner": [0, 8, 9, 18], "gareth": 0, "gat": [0, 18], "gate": [8, 11], "gaussian": 11, "gayoung": [], "gdsb23": [0, 8, 9, 16, 18], "ge": [0, 5, 8, 9], "geeta": [], "gef": [], "gemmek": [], "gen": [], "gener": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 28], "generalis": 9, "generate_diffusion_cond": 10, "generated_text": 4, "genr": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "geoffroi": [0, 9], "geometr": [6, 28], "geon": [], "georg": [0, 6, 8, 15], "geq": [], "gerhard": [], "germain": 0, "german": 19, "gert": [0, 17, 18, 27], "gestin": [], "get": [6, 10, 11, 19], "get_device_nam": [4, 24], "get_ipython": 4, "get_item_vector_db": 24, "get_pretrained_model": 10, "get_query_embed": 24, "geyu": [0, 6], "ggbe24": [], "ggdre22": [0, 13], "ggre21": [], "ghanem": [], "gharbi": [], "ghasemipour": [], "ghe22": [0, 8, 9], "gherardi": [], "ghosal": [0, 5], "gil": [], "gimelshein": [], "gin": [], "gin_config": [], "ginneken": [], "ginsburg": [], "giorgi": 0, "giorgio": 0, "giovanni": [0, 8, 9], "girish": [0, 28], "girl": 24, "git": 15, "git18": [], "gitdb": [], "github": [4, 10, 15, 24], "gitpython": [], "give": [5, 8, 9, 11, 19], "given": [2, 6, 8, 9, 11, 12, 13, 17, 19, 21, 22, 26, 27], "gl83": [0, 12], "glasner": [], "glass": [0, 8], "gll": [0, 8], "global": [9, 11, 28], "glove": 28, "gltq23": [0, 9], "gmmp23": [], "go": [2, 9, 11, 13, 18, 19, 22, 24], "goal": [3, 6, 7, 9, 11, 19, 28], "goe": 19, "goel": [0, 8], "goh": [0, 28], "gokul": [], "golai": [], "gold": 6, "goldberg": [0, 8, 9], "golden": [], "gome": [], "gomez": [0, 5, 8, 9], "gone": 7, "gong": [0, 8], "gongfan": [], "gontijo": [], "good": [2, 11, 12, 19], "goodfellow": 0, "googl": [4, 13, 15, 24], "gordon": 0, "got": [4, 10, 24], "gouyon": [0, 28], "goyal": [], "gpt": [0, 4, 5, 8, 15, 18, 19, 22, 28], "gpt2": 4, "gpt2lmheadmodel": 4, "gpt2token": 4, "gpu": [4, 24], "grachten": 0, "gradient": [0, 2, 11, 22], "gradio": [], "gradio_cli": [], "gradual": 11, "grai": 21, "grain": [0, 2, 5, 8, 9, 28], "gram": [6, 22], "grandios": 4, "granular": 27, "graph": [6, 19, 28], "grave": [0, 17], "great": [2, 4, 11, 20], "greater": [13, 28], "greatest": 16, "green": [0, 13, 16, 17, 21], "greenwood": 0, "greg": [], "gregori": [], "grew": 17, "griffin": 0, "gritsenko": [], "groh": [], "grosch": [0, 28], "gross": [], "ground": [4, 6, 24, 26], "groundwork": 17, "grounth": 6, "group": 0, "grow": [2, 25, 27], "grown": 13, "grpcio": [], "grug17": [], "gschwind": [], "gskp23": [0, 2, 13], "gt": 4, "gu": 0, "guanglu": [], "guangzhi": [0, 8], "guestrin": [], "gui": [0, 8], "guid": [0, 2, 5, 8, 13, 23, 25], "guidanc": [0, 2, 10, 15], "guitar": [2, 5, 10, 16, 24], "gulrajani": [0, 17], "gunjan": [], "guo": [0, 5, 8, 9], "guojun": [0, 17], "gupta": [], "gupta2023photorealisticvg": [], "guu": [0, 18], "gy": [0, 8, 9, 18, 28], "gy\u00f6rgi": [0, 9], "gz": [], "h": [0, 8, 11], "h11": [], "h5py": [], "h_audio": 24, "h_text": 24, "ha": [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 23, 27, 28], "had": [11, 28], "hadjer": 0, "hai": [], "hall": 0, "hallaci": [0, 28], "hallucin": [19, 20], "han": [0, 8], "hand": [7, 9, 19], "handl": [11, 16, 17, 19, 23, 25], "hang": [0, 8], "hani": [], "hann": [], "hantrakul": 0, "hao": [0, 5, 8, 9], "haoh": [0, 18], "haoran": [], "haoxin": [], "haoyi": [0, 28], "happen": [4, 19], "happi": [5, 16, 27, 28], "hard": [2, 4, 7, 11, 19, 28], "harder": 9, "harm": [], "harmon": [2, 6], "hat": [8, 11], "have": [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 22, 23, 24, 27, 28], "hawlei": 0, "hawthorn": 0, "hayk": 0, "hce": [], "he": [0, 8, 9, 15], "head": [19, 28], "hear": [0, 7, 8], "heard": 5, "heart": 24, "heavi": [5, 28], "heewoo": [0, 18], "heiga": [0, 17], "height": 11, "helen": [], "helena": [0, 5, 8, 9], "hellsten": [], "help": [0, 3, 6, 12, 17, 19, 26, 28], "hennequin": [0, 8, 9], "her": [4, 15, 24], "herald": 15, "herbert": [], "herd": [], "here": [2, 7, 10, 11, 21, 24], "herman": [], "herreman": [0, 5], "herrera": [0, 9], "hershei": [], "hertz": [], "hertzmann": [], "hesit": 3, "heusel": 0, "hf_token": 10, "hh": [], "hhl": [0, 8, 9], "hhy": [], "hi": 15, "hi79": [0, 13], "hidden": [8, 11, 21], "hierarch": 0, "high": [0, 2, 9, 11, 12, 16, 17, 19, 27, 28], "higher": [6, 7, 11, 13, 26], "highest": 26, "highli": 12, "highlight": [2, 4, 6, 7, 10, 15, 23, 25], "hila": 0, "hiller": 0, "hilton": [], "hing": 28, "hint": 19, "hiromi": [0, 6, 8], "hirsh": [], "histor": [21, 23], "histori": [9, 22, 23, 25], "hit": 11, "hja20": [0, 13], "hjc": [], "hjl": [0, 16, 18, 28], "hla": [], "hlss23": [0, 5, 8], "hmt": [], "ho": 0, "hochreit": 0, "hoffman": 0, "holger": [0, 28], "holist": [0, 19], "holoview": [], "holynski": [], "hongsheng": [], "hongyin": [0, 8], "hook": 24, "hope": 3, "horac": [], "hot": [16, 17, 18, 19], "hotel": 28, "hou": [0, 18], "how": [0, 2, 3, 4, 6, 7, 8, 9, 11, 15, 17, 18, 19, 21, 22, 23, 25, 26, 28], "howev": [8, 9, 11, 12, 15, 17, 19, 21, 23, 25, 27, 28], "hpn17": [0, 13], "hpw": [], "hru": [0, 12], "hs21": [], "hsg": [], "hsiang": [0, 6, 8], "hsiao": 0, "hsin": 0, "hsr": [], "hsuan": [0, 17, 18], "ht": 28, "html": 5, "http": [0, 4, 5, 6, 8, 9, 10, 15, 24], "httpcore": [], "httpx": [], "hu": [], "huam": [0, 8], "huang": [0, 5, 8, 9, 18, 28], "hub": [], "hubert": 0, "hug": 24, "huge": [16, 19, 20, 21, 22], "huggingfac": [4, 10, 24], "huggingface_hub": 10, "hugo": 0, "hui": [0, 28], "huiwen": 0, "hum": 5, "human": [0, 6, 7, 9, 13, 15, 17, 18, 20, 23, 25, 27, 28], "humphrei": [], "hundr": 19, "hussain": [0, 5, 8], "hvu": [0, 13], "hy20": [0, 13], "hybrid": 8, "hyelin": [], "hyper": 2, "hyperparamet": 28, "hyung": [0, 18], "hyungjin": [], "hzrs16": [], "i": [0, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], "ian": 0, "icassp": [0, 5, 8, 9, 15, 18, 28], "icassp48485": [0, 5, 8], "iccv": 0, "iclr": [0, 17], "icml": [0, 8, 15, 18], "id": [0, 5, 8], "idea": [11, 16, 19, 28], "ideal": 19, "ident": 11, "identifi": [12, 19, 26, 28], "idf": 6, "idna": [], "ieee": [0, 5, 8, 9, 15, 17, 18, 27, 28], "ieeexplor": [0, 9], "iffus": [], "ignor": [10, 16, 23], "ignore_index": 4, "ijcai": [0, 8], "ijcnn": [0, 8, 9, 18], "ijcnn54540": [0, 9], "ikemiya": 0, "il": [], "ilaria": [0, 3, 5, 6, 8, 9, 15, 18, 23, 28], "ilg": [0, 18], "illia": 0, "illustr": [13, 14, 18, 19], "ilya": [0, 18], "imag": [0, 6, 8, 11, 12, 16, 17, 19, 20], "imagegpt": 19, "imageio": [], "imagin": [2, 11, 24], "imbu": 11, "immers": [], "impact": [8, 16, 20], "imperi": 15, "implement": [8, 20, 22, 24, 28], "implicit": 23, "implicitli": 11, "import": [3, 4, 5, 6, 10, 11, 16, 19, 23, 24, 25, 26, 28], "importlib": [], "importlib_resourc": [], "impract": 27, "impress": [13, 24], "improv": [0, 4, 13, 19, 23, 24, 25, 26, 28], "inabl": [2, 23, 25], "inaccur": 19, "inbar": [], "inc": 0, "includ": [4, 6, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 22, 26], "inclus": [12, 15], "incorpor": [4, 9, 13, 14, 18, 19, 23, 25, 28], "incorrect": 19, "incorrectli": 26, "increas": [19, 20, 27, 28], "increasingli": [3, 7], "incredibli": 24, "independ": [23, 25], "index": [4, 11, 24], "indi": 5, "indian": [4, 24], "indic": [12, 19, 21, 24, 26, 27], "individu": 28, "indulg": 15, "infer": [0, 2, 10, 16, 18, 19, 23, 27], "infinit": [], "influenc": [2, 8, 18, 24, 28], "influenti": [21, 22], "info": [], "infonc": 28, "inform": [0, 2, 4, 6, 8, 9, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27], "informat": [4, 24], "infus": [4, 24], "inher": 23, "init": 2, "init_temperatur": 24, "initi": [4, 8, 11, 13, 17, 23, 27], "initialis": 8, "inlin": [], "innov": [13, 15, 20, 28], "inpaint": 2, "input": [2, 4, 8, 9, 11, 12, 13, 14, 17, 21, 22, 28], "input_id": [4, 24], "inputs_emb": 4, "insid": [11, 21], "insight": 12, "inspir": [4, 13], "instal": [4, 10, 15], "instanc": [21, 28], "instead": [5, 6, 7, 8, 9, 11, 19, 21, 23], "institut": [0, 17, 20, 27], "instruct": [0, 5, 6, 8, 9, 18], "instruct_2023": [], "instructgpt": 19, "instructpix2pix": [], "instrument": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "int": [4, 24], "int16": 10, "integ": 21, "integr": [11, 12, 13, 18, 19, 25], "intellig": [0, 8, 15, 19, 28], "intend": [13, 22], "intens": 2, "intent": [0, 16, 19, 23, 25], "interact": [9, 13, 15, 19, 20, 23, 25, 28], "interest": [2, 7, 15, 17, 22, 23, 24, 25], "interestingli": 19, "interfac": [8, 15, 25], "interleav": [11, 19], "intermedi": [0, 8], "intern": [0, 5, 6, 8, 9, 11, 15, 17, 18, 19, 20, 25, 28], "internationaltunion01": [], "internet": [20, 28], "interpret": [8, 19], "interspeech": 0, "interv": [], "intervent": 0, "intric": 4, "intro": 11, "introduc": [3, 11, 13, 14, 18, 19, 21, 25, 28], "introduct": 18, "intuit": [2, 16, 19, 25, 27], "invalu": 12, "invers": [0, 2], "invert": [], "invertib": 2, "investig": [3, 18], "invok": 19, "involv": [10, 11, 17, 19, 26], "io": [], "ipd": 10, "ipython": [4, 5, 10, 24], "iqbal": 0, "irani": [], "iren": 0, "irrelev": [26, 28], "is_avail": [4, 10, 24], "isaacson": 0, "isca": 0, "ish": 4, "ishaan": [0, 17], "ismir": [0, 6, 8, 9, 11, 15, 17, 18, 28], "ismir2008": 17, "ismir2019": 17, "ismir2021": 17, "isn": [5, 6, 19, 23], "isnn": [0, 8], "isola": [], "isotrop": 11, "issn": [0, 9], "issu": [4, 6, 10, 23, 24], "itai": [0, 18], "item": [0, 4, 5, 8, 9, 16, 18, 24, 25, 26], "item_joint_embed": 24, "item_vector_db": 24, "iter": [2, 8, 23], "its": [2, 4, 7, 8, 9, 12, 14, 15, 16, 17, 19, 20, 22, 23, 27, 28], "itself": [2, 5, 9, 11, 19], "itu": 0, "izze17": [], "j": [0, 18, 28], "jaakko": [], "jaakkola": [], "jack": [0, 28], "jacob": [0, 18, 28], "jacquelin": [0, 9], "jade": [0, 18], "jae": 0, "jaegl": [], "jaejun": [], "jaesik": [], "jaewoong": [], "jai": [], "jain": [0, 17], "jakob": 0, "jame": [0, 8], "jamendo": [5, 8], "jampani": [], "jan": [0, 5], "janko": [0, 18], "jann": 0, "janner": [], "jansen": [0, 5, 18, 28], "jargon": 19, "jasa": [], "jascha": 0, "jasco": 2, "jason": [0, 18], "jauhri": [], "javier": 0, "jayasumana": [], "jazz": [27, 28], "je": [], "jedi": [], "jeffrei": [0, 9, 18], "jen": [], "jeong": [0, 18, 28], "jeongsol": [], "jess": [0, 5, 17, 18], "jessica": [], "ji": [], "jiacheng": [], "jiaheng": [0, 5, 8, 9], "jiahui": 0, "jiaji": [], "jiajia": [], "jiam": [], "jian": [], "jianbin": [], "jianfei": [], "jiang": [0, 6, 8, 18], "jianglong": [], "jianlin": [], "jianmin": [], "jianxin": [0, 28], "jiasheng": [0, 8], "jiawei": [], "jiayi": [], "jie": [0, 8], "jimmi": [], "jin": [0, 6, 8, 23], "jinan": [], "jinbo": [], "jing": [], "jinglin": [], "jingren": [0, 6, 8], "jingwei": 0, "jinja2": [], "jinwoo": [0, 6], "jiong": [], "jitong": 0, "jiwen": [], "jiyoung": [0, 18, 28], "jmespath": [], "jnmr": [0, 9], "joao": [], "joar": [], "job": 20, "joblib": [], "john": 0, "join": [4, 8, 24], "joint": [0, 2, 3, 8, 9, 18, 23, 25], "joint_dim": 24, "jointembeddingmodel": 24, "jointli": 28, "jona": [], "jonah": 0, "jonathan": 0, "jone": 0, "jong": [0, 15, 18, 28], "jongmin": [], "jongpil": [0, 8, 9, 17, 18, 28], "jongwook": 3, "joon": [], "joonseok": [0, 18, 28], "jooyoung": [], "jordi": 0, "jort": [], "jose": [0, 17], "josef": [0, 5], "joseph": [], "josh": [0, 8, 9, 18], "joshua": [], "josiah": 0, "journal": [0, 9, 17, 18, 23], "journei": 17, "joy": 5, "jrv": [], "jsonmerg": [], "jsonschema": [], "ju": 0, "juan": 0, "judg": [], "judgement": 6, "judith": [0, 18, 28], "juhan": [0, 8, 9, 15, 17, 18, 23, 25, 28], "juho": [], "jukebox": [0, 13, 15, 18], "jukedrumm": [], "julian": [0, 5, 8, 9, 15, 17, 18], "julio": [], "juliu": [], "jump": 11, "jun": [0, 8, 18], "junbo": 0, "junda": [0, 5, 8, 9], "june": [0, 5, 8, 9], "junghyun": 0, "junho": [], "junqi": [0, 8], "junyan": 0, "jupyt": 15, "just": [10, 11, 12, 19, 20, 21, 23, 24, 25, 27], "justin": [0, 5, 28], "k": [8, 9, 14, 24, 26], "k_diffus": [], "kaal22": [], "kadian": [], "kai": [0, 17], "kaim": [], "kaiser": 0, "kaist": 15, "kakao": 15, "kal": [], "kalambarkar": [], "kalchbrenn": [0, 17], "kamko": [], "kamyar": [], "kang": [], "kant": [0, 18], "kao": [], "karagol": [], "karan": 0, "karen": [0, 17], "karlinski": [0, 8], "karra": [], "karsten": [], "karunratanakul": [], "kastner": [], "katarina": [0, 18], "kate": [0, 8], "katerina": 0, "katharopoulo": 0, "katherin": [0, 18], "kavukcuoglu": [0, 17], "kawar": [], "kazuhito": [], "kb": [], "kb14": [], "kbockw15": [], "kci": [0, 12], "ke": [0, 3, 5, 8, 15, 18, 19, 23, 28], "keep": 24, "keepdim": 16, "kei": [3, 7, 8, 9, 10, 12, 13, 14, 17, 18, 23, 26, 27, 28], "keji": [], "kelvin": [0, 18], "kenton": [0, 18, 28], "kept": 8, "keqiang": [], "keren": [], "keunwoo": [0, 8, 9, 17, 18, 23, 25, 28], "kevin": [0, 5, 8, 9], "kexin": [], "keyword": [0, 13, 28], "kfir": [], "kgb": [0, 8], "kharitonov": [], "khurana": 0, "khz": 11, "ki": [], "kilgour": 0, "kilian": [], "kim": [0, 9, 15, 18, 23, 25, 28], "kind": [5, 6, 9, 19, 21], "kingma": 0, "kirchhoff": [0, 28], "kirel": [], "kirkpatrick": [0, 8, 15, 18, 28], "kirsch": [], "kiwisolv": [], "kjz24": [], "kkdb": [], "kkkm23": [], "kl": [11, 12], "klaski": [], "kll": [0, 11], "knife": [0, 6, 8], "knob": 10, "knolwedg": 4, "know": [9, 21], "knowledg": [0, 4, 8, 9, 16, 19, 20, 21, 24, 28], "known": [10, 27, 28], "koh": [], "kohler": [], "koichi": [], "koishida": [], "kong": [0, 8], "konpat": [], "koo": 0, "korai": [0, 17], "kornia": [], "kornia_r": [], "korraw": [], "korzeniowski": [], "kosta": 0, "kostrikov": [], "kozareva": [0, 8, 9], "kpa": [], "kph": [], "kpschonfeld": [], "krasheninnikov": [], "kreb": [], "krei": [], "kreuk": [0, 18], "kristina": [0, 8, 9, 18, 28], "krisztian": [0, 18, 25], "krueger": [], "kshiteej": [], "ksl": [0, 11], "ksm": [0, 9], "ksp": [0, 13], "ku": [0, 6, 8], "kullback": 12, "kumar": [0, 17], "kundan": [0, 17], "kuznetsov": 0, "kw13": [], "kwak": [], "kwan": [], "kwg": [0, 2], "kwon": [0, 23, 25], "kyle": [], "kynk": [], "kynkaanniemiak": [], "kyogu": [0, 6], "kyunghyun": [0, 9], "kzb": [], "kzl": [], "kzrs18": [], "kzrs19": [0, 12], "kzz": [], "l": [6, 8, 9, 28], "l1": 14, "l177": 10, "l2": 14, "lab": 15, "label": [0, 4, 7, 9, 12, 17, 19, 21, 24, 26, 27], "lack": [2, 11, 19, 23, 28], "lai": 0, "laid": 17, "lain": [], "laion": [], "laion_clap": [], "lala": [], "lam": [], "lam08": [0, 17], "lama": [0, 18], "lamer": [0, 17], "lamtharn": 0, "lanckriet": [0, 17, 18, 27], "land": [], "lang": 0, "langaug": [24, 28], "languag": [0, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 20, 23, 24, 25, 27, 28], "lanzendarferppw": [], "lanzendarferppw24": [], "lanzendorf": [0, 8], "lanzendorfer_blap_nod": [], "lanzend\u00e3": [], "larg": [0, 4, 5, 6, 7, 8, 11, 18, 19, 20, 23, 24, 25, 26, 27, 28], "larger": [4, 19, 24], "larson": [0, 8], "last": [4, 7, 10, 16, 19, 24], "last_hidden_st": 24, "lastli": [20, 21], "late": [0, 6, 8], "latenc": [19, 20], "latent": [0, 2, 8, 11, 13, 14, 24, 28], "later": [8, 11, 13, 19, 22], "latest": [7, 15, 22], "latin": 20, "latter": 9, "lattner": 0, "lau": [], "launch": 15, "laura": [], "laurent": [], "laurier": [0, 17], "lav": [], "lawrenc": [], "layer": [0, 8, 11, 19, 21, 27, 28], "lazi": [], "lazo": [], "lazy_load": [], "lbd": [0, 6], "lcy": [0, 11, 13], "ldm": 0, "ldot": [8, 9, 26], "le": [0, 8, 18], "leach": [], "lead": [6, 12, 13, 19, 23, 25, 27, 28], "learn": [0, 2, 3, 4, 7, 8, 9, 11, 12, 13, 15, 17, 18, 21, 22, 23, 25, 26, 27], "learnabl": [8, 28], "learner": [0, 4, 16, 18], "learnt": 4, "least": [6, 11, 26, 28], "leav": 23, "led": 13, "lee": [0, 6, 8, 9, 17, 18, 23, 28], "lee10": [0, 23], "leemput": [], "left": [8, 11, 19], "legend": 16, "legg": 0, "lehtinen": [], "lei": [], "leibler": 12, "lejaren": 0, "lejun": 0, "len": [4, 11, 16, 24], "leng": [0, 6], "length": [5, 6, 9, 10, 11, 19, 20], "lenient": 26, "leo": 0, "leonard": 0, "leoni": [0, 18], "leonid": [0, 8], "lerman": [0, 8, 9], "less": [12, 19, 20, 26], "lester": [0, 18], "leszczynski": [0, 18, 25], "let": [2, 6, 8, 9, 16, 19, 20, 21, 28], "letman": [], "letter": [0, 9, 19], "level": [0, 2, 6, 7, 8, 9, 11, 12, 16, 22, 27, 28], "leverag": [2, 3, 4, 8, 12, 13, 14, 16, 19, 25], "levi": 0, "levin": [], "lexic": 6, "lezcano": [], "lgw": [0, 2], "lhss24": [0, 5, 8], "li": [0, 8, 9, 12, 18, 28], "liang": 0, "liao": [0, 6, 8], "lib": [4, 10, 24], "librari": [], "librosa": 10, "licens": [4, 5], "lick": 5, "lifeng": [], "light": 8, "lightn": [], "lightning_util": [], "lightweight": 8, "like": [2, 5, 6, 8, 9, 10, 11, 13, 15, 16, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28], "likelihood": 17, "lim": 0, "limit": [0, 3, 8, 9, 13, 16, 17, 18, 20, 21, 22, 27, 28], "lin": [0, 6, 9], "lina": [], "linalg": 16, "line": [4, 8, 10, 16, 19, 21, 24, 28], "line2d": 16, "linear": [4, 6, 8, 11, 24], "lingelbach": [], "linguist": [0, 5, 6, 8, 9], "link": [4, 8], "linkifi": [], "linmiao": [], "linux": 10, "linyang": [0, 8], "lior": [0, 17], "lipman": [], "list": [4, 11, 19, 24, 26], "listen": [0, 4, 5, 8, 10, 17, 24], "literatur": [6, 9, 19], "littl": 6, "liu": [0, 5, 6, 8, 9, 18, 24, 28], "liu19": [0, 28], "liu_music_2024": [], "live": [5, 21], "liwei": [], "lka": [], "lkopf": [], "ll": [2, 4, 8, 10, 11, 21, 22], "ll24": [0, 6], "llama": [0, 5, 8, 19], "llark": [0, 8, 9, 18], "ller": [], "llion": 0, "llm": [0, 3, 5, 18, 19, 20, 28], "llvmlite": [], "lm": [2, 11, 13, 18], "lmn23": [], "lmnt": [], "lmz": [], "ln17": [0, 9], "load": [10, 24], "load_dataset": [4, 5, 24], "loader": [], "local": [0, 4, 10, 24], "local_attent": [], "localis": 9, "locat": 21, "lockhart": [], "log": [11, 19, 24, 28], "logit": [4, 17, 19, 24, 27], "logit_scal": 24, "loi": 0, "london": 15, "long": [0, 6, 8, 9, 11, 12, 17, 18, 19, 21, 22, 24, 27], "longbo": [], "longer": [6, 9, 11, 19], "longest": [4, 6, 24], "longpr": [0, 18], "look": [7, 9, 11, 18, 19, 21, 23, 26, 27], "looper": 2, "lope": [], "lorenz": [], "loss": [3, 4, 14, 19, 23, 24, 25], "loss_a2t": 24, "loss_t2a": 24, "lost": [], "lot": [11, 16, 19, 21], "loui": [], "low": [2, 12, 16, 27], "lower": [12, 19, 26], "lowest": 19, "lp": [0, 4, 5, 8, 18, 24], "lpg": [], "lppw24": [0, 8], "lr": [4, 24], "lsp": [0, 9], "lstm": 22, "ltgm19": [], "lth": [], "ltl": [], "ltl24": [0, 8], "ltu": 8, "lu": [0, 8, 9, 17], "luan": [0, 18], "luca": [0, 8], "luckili": 2, "lueb": 0, "luk": [], "lukasz": 0, "luke": [0, 17, 18, 27], "lun": [0, 6, 8, 18], "lunch": [], "luo": [0, 8], "luong": [], "lupe": [], "lv": [0, 6], "lvmin": [], "lxjz23": [], "ly": [], "lyl": [0, 11], "lyric": [9, 16, 23], "lyt": [], "lzb": [], "lzg": [0, 25], "m": [0, 8, 9, 15, 18], "m1": 10, "m2ugen": [0, 5, 8], "ma": [0, 5, 8, 9], "maarten": [0, 18], "mac": 10, "mach": 0, "machan": [], "machin": [0, 6, 7, 8, 13, 15, 17, 18, 19, 21, 28], "maciej": [], "macosx_10_10_x86_64": [], "macosx_10_12_x86_64": [], "macosx_10_13_x86_64": [], "macosx_10_15_universal2": [], "macosx_10_15_x86_64": [], "macosx_10_5_x86_64": [], "macosx_10_6_intel": [], "macosx_10_9_intel": [], "macosx_10_9_universal2": [], "macosx_10_9_x86_64": [], "macosx_11_0_arm64": [], "macosx_14_0_x86_64": [], "madmom": [], "maestro": [], "magazin": [0, 17, 18], "magenta": 13, "magnatagatun": [4, 5, 8, 24, 28], "magnatagtun": [], "magnitud": 8, "maher": [], "maheswaranathan": [], "mahieux": [0, 17], "mai": [2, 5, 6, 9, 12, 13, 19, 21, 23, 25], "main": [0, 2, 5, 7, 8, 9, 10, 11, 20, 21, 27], "maintain": [23, 25, 27], "major": [6, 20, 21, 23, 28], "majumd": [0, 5], "make": [2, 3, 4, 5, 8, 9, 10, 19, 20, 21, 23, 25, 27, 28], "male": 5, "malici": 20, "malinowski": [], "manag": [20, 27], "manco": [0, 5, 6, 8, 9, 15, 18, 23, 28], "mancusi": 0, "mandic": 0, "maneesh": [], "mani": [2, 5, 6, 8, 9, 11, 16, 20, 21, 22, 23, 24, 28], "manifold": 16, "manilow": 0, "mannies": [], "manoj": [], "manor": 0, "manual": 11, "mao": [0, 6, 8], "map": [2, 4, 8, 9, 11, 13, 19, 21, 26, 28], "marc": 0, "marcel": [], "marco": [0, 5, 18], "mard": 5, "margin": [11, 28], "mari": 15, "mariani": 0, "marianna": [0, 18], "marini": [], "mario": [], "mark": [0, 8, 9, 13, 21], "markdown": [], "markdown2": [], "marker": 16, "markerfacecolor": 16, "markers": 16, "markov": 19, "markupsaf": [], "mart": 0, "martin": [0, 6, 8], "martiro": 0, "marvin": [], "mask": [0, 2, 4, 13, 18, 22, 28], "maskgit": 0, "massachusett": [0, 17, 27], "massiv": 21, "masterpiec": 4, "match": [12, 19, 21, 26, 27, 28], "matena": [0, 18], "materi": 15, "mateusz": [], "math": [11, 19], "mathbb": 11, "mathbf": 11, "mathcal": [9, 11, 28], "mathemat": [19, 22], "mathew": [], "mathews1969technologi": [], "mathrm": 11, "mathur": [], "matplotlib": 16, "matrix": 12, "matt": 0, "matthew": 0, "matthia": [], "mauricio": 0, "mauro": [0, 5, 18], "max": [10, 24, 28], "max_length": [4, 24], "maxim": [19, 28], "mayb": 2, "mb": [], "mbl10": [], "mbqf21": [0, 8, 9, 18], "mbqf22": [18, 28], "mbqf22a": 0, "mbqf22b": [0, 16], "mc": 24, "mcaulei": [0, 5, 8, 9, 15, 17, 18], "mccann": [], "mcfee": [], "mckee": [0, 5], "mcleavei": [], "mcy": [], "mdit": [], "mdurl": [], "me": [24, 28], "me14": [], "mean": [0, 2, 5, 6, 8, 9, 16, 17, 19, 24, 26, 27, 28], "meaning": [6, 27, 28], "measur": [4, 6, 12, 24, 25, 26, 28], "mechan": [2, 4, 8, 14, 21, 25, 27, 28], "media": [4, 11], "median": 26, "medic": [], "medium": 2, "meet": [0, 6, 8, 26], "megan": [0, 18, 25], "mehri": [0, 17], "mehrish": [], "mei": 0, "meinard": [], "mel": [11, 14, 21], "melanchol": [24, 28], "melechovski": [0, 5], "melgan": [], "melod": 13, "melodi": [2, 4, 5, 13], "member": 15, "memcnn": [], "memo": [], "memori": [], "meng": [], "menghan": [], "mengji": [0, 6, 8], "mengy": [], "menick": [], "mention": 16, "mert": [4, 28], "mesmer": [4, 24], "meta_db": 24, "metadata": [2, 16, 23, 24, 27, 28], "metal": 28, "meteor": 6, "meter": [], "method": [0, 2, 3, 8, 9, 12, 14, 16, 17, 18, 19, 21, 22, 28], "methodologi": [3, 15, 17, 18], "metric": [0, 3, 12, 18, 24, 26], "metzler": [], "mexico": [0, 5, 8, 9], "mfmw24": [], "mgg": [0, 2, 5], "mha": [], "mi": [], "miccai": [], "micha": 0, "michael": [0, 18], "michal": [], "micha\u00ebl": [], "michel": 0, "michigan": 15, "micro": [], "midi": [], "midinet": [0, 13], "might": [2, 4, 13, 19, 28], "migneco": [0, 9], "mihir": [], "miika": [], "mike": [], "mikhail": [], "mildenhal": [], "miller": [0, 17], "million": [5, 19], "mimic": 8, "min": [0, 24], "ming": [0, 17, 18, 28], "mingbo": [], "minghui": [], "mingni": [], "minguk": [], "mingz": [], "mini": [19, 28], "minim": [11, 19, 28], "minimum": 28, "minor": 10, "minz": [0, 4, 5, 9, 18, 23, 24, 28], "mir": [2, 3, 4, 9, 15, 16, 24], "mir_ev": [], "mishkin": [0, 18, 28], "misinform": 20, "mislead": 12, "mismatch": 2, "miss": [0, 3, 23, 25, 26], "mission": 20, "mit": [5, 17], "mitsubishi": 15, "mitsufuji": [0, 6, 8], "mix": [6, 8], "mixtur": 8, "mixup": [0, 18], "mjxz23": [0, 13], "mkg": [0, 13, 17], "mlm": 21, "mlp": [8, 11], "mlvalimaki23": [], "mlx": [], "mm24": [0, 2], "mmm": [], "mo": [], "modal": [0, 5, 8, 12, 13, 15, 19], "mode": [2, 19], "model": [0, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 20, 23, 25, 27], "model_config": 10, "moder": 11, "modern": [11, 21, 23, 24], "modifi": [0, 8, 23], "modul": [2, 4, 8, 10, 11, 12, 14, 16, 24], "modulenotfounderror": [4, 10, 16, 24], "moham": [0, 17], "mohammad": [0, 17], "mojtaba": 0, "mokadi": [], "mold": 2, "molei": [], "molin": [], "monica": 0, "monitor": [0, 20], "mood": [2, 5, 9, 16, 17, 23, 27, 28], "moon": [], "moonseok": [], "moor": [], "mor": [0, 17], "more": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28], "morgan": [], "morph": 2, "morton": [0, 9], "mosseri": [], "most": [2, 4, 7, 8, 10, 11, 12, 16, 17, 19, 21, 22, 24, 28], "mostafa": [0, 18], "mostli": 23, "motion": [], "motiv": 11, "move": [4, 25], "mpmath": [], "mpt": 5, "mqa": [6, 8, 9], "mrr": 26, "msci": 15, "msd": 28, "msdm": 2, "msn24": [0, 28], "mssr23": [0, 5], "mtc": [], "mtg": [5, 8], "mtp": [0, 2], "mtt": [4, 24], "mu": [5, 8], "mucap": [5, 8], "much": [6, 8, 9, 10, 11, 21, 23], "muchomus": [0, 6, 8], "muedit": [5, 8], "muhammad": [], "mul": 10, "mulab": [4, 15, 24], "mulan": [0, 16, 18, 28], "mulap": 16, "mullama": 9, "muller15": [], "multi": [0, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 25], "multiclass": 27, "multidict": [], "multilabel": [27, 28], "multimedia": [0, 17, 28], "multimod": [0, 3, 6, 9, 15, 18, 20, 28], "multipart": [], "multipl": [0, 6, 8, 9, 11, 16, 18, 19, 20, 21, 23, 25, 26, 27, 28], "multitask": [0, 4, 18], "multitrack": [0, 13], "murata": 0, "murtadha": [], "muscap": [0, 8, 9, 18], "muse": [], "musegan": [0, 13], "musemorphos": [], "music": [0, 3, 4, 6, 12, 13, 14, 16, 19, 21, 22, 23, 24, 25, 26, 27, 28], "musicbench": 5, "musiccap": [0, 4, 5, 8, 18, 24], "musiccaptioningmodel": 4, "musicfm": [4, 24], "musicfm_emb": [4, 24], "musicgen": 13, "musicgenerationtempl": [], "musichifi": 0, "musician": [2, 5], "musicinstruct": [5, 8], "musicldm": [0, 11, 13, 18], "musiclm": [0, 5, 13, 18], "musicmagu": [0, 2], "musicnet": 8, "musicqa": [5, 8], "musictextclip": [5, 8], "musictextdataset": [4, 24], "musicva": 13, "musika": [], "musilingo": [0, 5, 8, 9], "must": [4, 23, 25, 28], "mustango": [0, 2, 5], "mvdream": [], "mwd": [0, 5, 23], "mwpt18": [], "mwpt19": [0, 17], "my": [], "n": [0, 4, 5, 6, 8, 10, 11, 18, 22, 24, 28], "n_compon": 16, "n_step": 10, "naacl": [0, 5, 8, 9], "nabla_": 11, "naeem": [], "nah": [], "naik": [], "nal": [0, 17], "nam": [0, 8, 9, 15, 17, 18, 23, 25, 28], "namburi": [0, 5, 8, 9], "name": [4, 8, 10, 13, 15, 16, 24, 27, 28], "nameerror": 4, "nan": [0, 18], "nanci": [0, 6], "nanxin": [], "naoki": 0, "narang": [0, 18], "narrow": 23, "nash": 0, "natalia": [], "nataniel": [], "nathan": [0, 8], "nathana\u00e3": [], "nation": 20, "nativ": [], "nattanat": [], "natur": [0, 2, 4, 5, 7, 8, 9, 13, 17, 18, 19, 22, 23, 24, 25, 27, 28], "navercorp": 15, "navig": 7, "navonil": [0, 5], "na\u00efv": 21, "nc": [5, 24], "ncl": [0, 17, 18], "ncsoft": 15, "nd": 5, "nearli": 11, "necessari": [6, 26], "necessarili": [12, 19], "necessit": 20, "need": [0, 4, 8, 10, 12, 15, 16, 17, 19, 20, 21, 23, 24, 25, 26, 27], "neg": 26, "neil": [], "nessler": 0, "net": [0, 8, 11], "neteas": 8, "network": [0, 8, 9, 13, 14, 17, 18, 19, 21, 22, 28], "networkx": [], "neuraip": [], "neural": [0, 8, 9, 12, 13, 17, 18, 19, 21, 22, 28], "neurip": [0, 15], "neurocomput": [], "never": 28, "new": [0, 2, 4, 9, 11, 13, 14, 15, 17, 18, 19, 23, 24, 27, 28], "newcom": 3, "newer": [6, 8], "newman": [], "next": [4, 8, 9, 10, 11, 14, 15, 19, 21, 22], "nez": 0, "nezhurina": [0, 18], "nfeld": [], "nice": 11, "nichol": 0, "nichola": [0, 18], "nick": [], "nickson": 0, "nicola": [], "nie": [], "nielsen": 0, "nieto": [0, 28], "night": 24, "nikhil": [], "niki": 0, "nikita": [0, 8], "nikolau": [], "niru": [], "nistal": 0, "nlp": [3, 6], "nm": 24, "nmbkb24": [2, 18], "nmbkb24a": [0, 2, 11], "nmbkb24b": 0, "nn": [4, 24], "nniemi": [], "no_grad": [4, 16, 24], "noam": [0, 17, 18], "nois": 11, "noise2": [], "noisi": [11, 16, 28], "non": [8, 11, 13, 20], "none": 4, "nonequilibrium": [], "nonsens": 19, "norberto": [], "norm": [11, 16], "normal": [10, 11, 24, 26, 27], "norman": [], "norouzi": [0, 17], "notabl": [2, 11, 12, 13, 23], "notat": 11, "note": [0, 2, 4, 5, 8, 10, 11, 13, 17, 21], "notebook": [4, 10, 15, 24], "nou": [], "nouri": [], "nov": [], "novack": [0, 5, 8, 9, 15, 18], "novack2024prestod": [], "novel": [0, 3, 8, 16, 18, 27, 28], "novelti": [0, 18], "novemb": 15, "now": [2, 4, 8, 9, 10, 11, 15, 19, 20, 21, 24], "np": [16, 24], "npa": [0, 2], "nsynth": [13, 17], "nuanc": [9, 16, 25, 28], "null": [], "num_epoch": [4, 24], "num_return_sequ": 4, "num_work": 24, "numba": [], "number": [2, 6, 10, 11, 12, 21, 27], "numel": [4, 24], "numer": 11, "numpi": [10, 16, 24], "nvp": [], "ny": [0, 8], "nzc": [0, 11], "o": [10, 16], "o1": 19, "oasi": 27, "object": [6, 12, 14, 19, 28], "obtain": [5, 6, 8, 9, 12, 15, 19], "occupi": 7, "occur": 11, "octob": [0, 6, 8], "od": 0, "off": [9, 11], "offer": [11, 12, 15, 18, 25, 28], "often": [2, 5, 8, 9, 12, 16, 19, 20, 22, 23, 28], "of\u00e2": [0, 8], "oh": [], "oi": 0, "olaf": [], "older": 2, "olivi": [], "olv18": [0, 28], "omer": [], "ommer": [], "omran": [], "onc": [10, 11, 19], "one": [2, 5, 8, 9, 11, 14, 16, 17, 18, 19, 20, 21, 22, 26, 27, 28], "ones": [4, 21], "ongo": 20, "onli": [2, 4, 5, 6, 8, 9, 11, 15, 17, 19, 21, 25, 26, 27, 28], "onlin": [15, 24], "onto": 21, "ontologi": [], "ontrol": [], "oor": 0, "oord": [0, 17, 28], "oov": 27, "open": [0, 2, 6, 11, 19, 20, 23, 27, 28], "openai": [0, 13, 15, 18, 19], "openli": 19, "openmu": [0, 6, 8], "openreview": [0, 8], "oper": [2, 3, 11, 23, 27], "opera": [4, 24], "operat": 4, "operatornam": 8, "opportun": [23, 25], "optim": [0, 2, 4, 18, 19, 22, 24, 28], "option": [8, 10, 19], "orama": [0, 28], "oran": [], "orchestr": 4, "orchestra": 4, "order": [4, 8, 9, 11, 21], "org": [0, 5, 6, 8, 9], "organ": [0, 8], "organis": 7, "orient": [], "origin": [2, 4, 6, 11, 12, 13, 19, 21, 24], "orio": [], "oriol": [0, 17, 28], "orjson": [], "orthogon": [16, 19], "oscar": [0, 17], "other": [0, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 28], "otherwis": 19, "our": [3, 8, 10, 11, 15, 19, 27], "out": [2, 3, 4, 8, 10, 11, 16, 18, 21, 24, 28], "outlier": 26, "output": [2, 4, 6, 8, 9, 10, 11, 12, 13, 17, 21, 22, 27, 28], "outsid": [15, 27], "ouyang": [0, 18], "over": [8, 11, 17, 21, 22, 25, 28], "overal": [2, 4, 6, 7, 8, 10, 11, 12, 19, 26], "overcom": [3, 6, 8], "overhead": 27, "overlap": 6, "overli": 21, "overview": [3, 8, 15, 22], "owj": [0, 18], "own": [4, 20], "p": [0, 4, 8, 9, 11, 17, 22, 24], "p310": [10, 15], "p_": [8, 11, 22], "p_0": 11, "p_1": 6, "p_i": [], "p_n": 6, "p_t": 11, "pablo": 0, "pachet": 0, "packag": [4, 10, 24], "pad": [4, 10, 24], "pad_token": 4, "pad_token_id": 4, "page": [3, 15], "pai": 11, "pair": [2, 4, 5, 7, 8, 12, 16, 19, 21, 28], "palett": [], "pamela": [0, 18, 28], "pan": [], "panda": 5, "pandei": [], "pandora": 15, "panel": [], "pann": [0, 12], "panorama": [], "paper": [0, 6, 8, 15, 16, 19, 21], "paradigm": [4, 7, 8, 11, 13], "paragraph": 19, "parallel": [10, 15, 21], "param": [4, 24], "paramet": [4, 8, 10, 11, 19, 22, 24, 28], "parameter": 11, "pardo": 0, "pareto": 19, "pari": [], "parikh": 0, "park": [0, 18, 28], "parker": 0, "parma": [4, 24], "parmaet": 22, "parmar": 0, "pars": [6, 11], "parser": [], "parso": [], "part": [4, 6, 7, 8, 11, 14, 17, 19, 21, 22, 24, 28], "partial": [6, 7, 10], "particip": [12, 15], "particular": [2, 4, 15], "particularli": [2, 5, 7, 8, 12, 13, 15, 17, 26, 27, 28], "partit": [], "partli": 21, "pasini": 0, "pass": [4, 8, 11], "passion": 15, "past": 21, "patashnik": [], "patch": 11, "patchifi": 11, "path": [5, 11], "pathak": [], "pathtool": [], "patrick": [0, 9], "pattern": [0, 5, 17, 19, 23, 26], "paul": [0, 17], "pauli": [], "pave": [13, 15], "pavlov": [], "payn": [0, 18], "pbar": [4, 24], "pcws22": [], "pd": 5, "peak": 10, "pedalboard": [], "peebl": 0, "peeter": [0, 9], "peizhao": [], "penalti": 6, "peng": [], "pengi": [0, 8], "peopl": [16, 23, 28], "per": [11, 19, 24, 28], "perceiv": [12, 28], "percept": 12, "perci": 0, "percuss": 15, "pereira": [0, 25], "perez": [], "perfect": [19, 24, 26], "perfecto": [0, 9], "perform": [0, 4, 8, 9, 11, 12, 13, 15, 16, 19, 24, 25, 26, 28], "perhap": [11, 23], "period": 2, "perplex": 16, "perraudin": [0, 8], "person": [], "personalis": 7, "perspect": [8, 12], "pertin": 18, "peter": [0, 18, 28], "pexpect": [], "pgpf23": [], "pgxh23": [], "ph": 15, "phase": [17, 18], "phbd03": [0, 9], "phd": [0, 15, 27], "phil": [], "philip": [0, 5, 23], "philipp": 0, "phillip": [], "philosophi": 8, "photorealist": [], "photoshopgenerativefil": [], "phrase": [16, 28], "physic": 15, "piano": [0, 15, 16, 28], "pianotre": 0, "pianotreeva": 13, "pick": [10, 19], "piec": [4, 9, 12, 17, 24, 26, 27, 28], "pierr": [], "pieter": 0, "pietquin": [], "pillow": [], "ping": [0, 8], "pink": 13, "pinkl": [0, 8], "pip": [4, 10, 15], "pipelin": 8, "pitch": [0, 13, 17], "pixel": 19, "pjbm22": [], "plai": [4, 10, 19], "plakal": [], "plan": 20, "platformdir": [], "platt": [], "play": 5, "playback": 7, "playground": [], "playlist": [0, 8, 9, 18, 25], "playntel": 8, "pleas": [3, 17], "plot": 19, "plotli": [], "plt": 16, "plugin": [], "plumblei": 0, "pmlr": [0, 17, 28], "point": [6, 11, 12, 19, 26], "polici": 19, "polit": 20, "polosukhin": 0, "polyak": [0, 17], "polyffus": 0, "polyfuss": 13, "polyphon": 0, "pon": 0, "pooch": [], "pool": 0, "poor": [8, 12], "pop": [0, 24, 27], "popular": [4, 7, 8, 19], "poria": [0, 5], "posit": [5, 7, 16, 21, 26, 28], "possess": 16, "possibl": [3, 6, 7, 8, 9, 10, 12, 13, 19, 21, 27, 28], "possibli": 5, "post": [8, 11, 19], "post0": [], "post1": [], "post2": [], "posterior": [], "postolach": 0, "potenti": [8, 12, 15, 18, 20], "power": [0, 4, 5, 8, 16, 19, 24, 28], "pp27": [], "pp32": [], "pp33": [], "ppo": 19, "prabhudesai": [], "practic": [2, 3, 7, 8, 11, 15, 18, 19, 21, 22, 26, 28], "practition": [26, 28], "prafulla": [0, 18], "pre": [0, 4, 5, 6, 8, 9, 11, 16, 18, 19, 27], "preced": [14, 21, 28], "precis": [6, 8, 15, 19], "predefin": [6, 9, 26, 28], "predict": [0, 9, 11, 13, 14, 17, 19, 21, 26, 27, 28], "predominantli": 23, "preechakul": [], "prefer": [0, 15, 18, 19, 23, 25], "prefigur": [], "prefix": [0, 4, 8, 11, 14, 19, 22], "prefix_length": 4, "prefix_mask": 4, "prefix_project": 4, "prem": 0, "prepar": 3, "preprint": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "preprocess": 14, "present": [7, 9, 11, 12, 15, 18, 23, 25, 28], "preserv": 4, "press": 0, "presto": [0, 15], "pretext": 19, "pretrain": [0, 9, 12, 14, 15, 16, 19, 24, 28], "pretti": 21, "prevent": 20, "previou": [8, 19, 21, 23, 25, 28], "previous": [11, 15, 25, 27], "primari": 9, "primarili": [13, 15, 17, 23, 28], "principl": [4, 13], "print": [4, 24], "prior": [2, 9], "pritch": [], "privat": 8, "pro": 24, "probabilist": 0, "probabl": [0, 8, 11, 13, 14, 19, 21, 22], "probe": 19, "problem": [16, 17, 18, 19, 20, 21, 23, 28], "problemat": 27, "proc": [0, 9], "proccess": 11, "proce": 11, "procedur": [], "proceed": [0, 6, 8, 9, 11, 18, 25], "process": [0, 2, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "prod_": [8, 9], "produc": [4, 5, 7, 8, 9, 19], "product": [0, 12, 15, 18, 19, 28], "program": [], "progress": [4, 5, 18, 19, 20], "progressbar": [], "proj": 11, "project": [4, 8, 11, 13, 21, 24, 28], "promin": [3, 24], "promis": [8, 19, 20, 25], "prompt": [0, 5, 6, 10, 11, 18, 19], "prone": 19, "pronounc": 6, "propag": [0, 17], "propcach": [], "properti": [4, 11, 24, 28], "proport": [19, 26], "proportion": 27, "propos": [9, 28], "protobuf": [], "protocol": 6, "proven": 22, "provid": [3, 5, 6, 7, 9, 10, 11, 12, 15, 19, 20, 21, 22, 23, 26, 27, 28], "proxim": 19, "pschluter22": [], "psdv": [], "pseudo": [0, 8, 16, 18], "psk": [0, 2], "psutil": [], "psycholog": [], "pt": [4, 24], "ptyprocess": [], "public": 10, "publish": [0, 15, 23], "puckett": [0, 17], "puhrsch": [], "pull": 2, "pumarola": [], "pure": [11, 18, 19, 25, 28], "purpos": [7, 8, 12, 13, 21], "push": 28, "put": [2, 4, 21], "pw": [0, 18, 28], "px23": [0, 11], "py": [4, 10, 24], "py2": [], "py3": [], "pycpars": [], "pycr": [], "pydant": [], "pydantic_cor": [], "pydub": [], "pygment": [], "pyloudnorm": [], "pynndesc": [], "pypars": [], "pyplot": 16, "pystoi": [], "python": [4, 15, 24], "python3": [4, 10, 24], "python_multipart": [], "pythonhost": [], "pytorch": [4, 24], "pytorch_lightn": [], "pytz": [], "pyviz": [], "pyviz_comm": [], "pywavelet": [], "pyyaml": [], "pzh": [], "q": [8, 14, 26], "qa": [], "qi": [0, 9], "qian": [0, 6, 8], "qiao": 0, "qifeng": [], "qing": [], "qingq": [0, 5, 18, 28], "qiu": [0, 8], "qiuqiang": 0, "qiyang": [], "quad": 11, "quadrat": 11, "qualit": 23, "qualiti": [0, 6, 13, 17, 18, 19, 26], "quantiz": [11, 14, 19], "queen": 15, "queri": [0, 5, 7, 8, 9, 14, 16, 18, 19, 24, 25, 26, 27], "query_vector": 24, "question": [0, 3, 5, 6, 8, 11, 19, 22], "quick": [10, 19], "quickli": [4, 19, 26], "quinton": [0, 6, 8, 9, 15, 18, 28], "quit": [2, 11], "qun": [], "quoc": [0, 18], "quot": 2, "qwen": [0, 8], "r": [0, 11, 15, 28], "r_1": 6, "rachel": [0, 8, 9, 18], "racial": 20, "radford": [0, 4, 18, 28], "radio": 28, "radiohead": 28, "radlinski": [0, 18, 25], "rafael": [0, 8], "raffel": [0, 18], "rag": 20, "rai": [0, 18], "ram": 0, "ramalingam": [], "ramesh": [0, 28], "ramsauer": 0, "randn": 10, "random": [4, 24], "random_st": 16, "randomnam": [], "rang": [2, 4, 6, 7, 8, 17, 18, 19, 22, 23, 24, 26, 27, 28], "rank": [19, 26, 27], "rank_q": 26, "rao": [], "rap": 16, "rapha": [], "raphael": [], "rapid": 15, "rapidli": 13, "raquel": [], "rare": [2, 21, 26, 28], "rashindra": [], "rate": [4, 10, 11, 12, 19, 24], "rather": [4, 8, 11, 12, 16, 17, 23, 27, 28], "rave": [0, 2], "ravi": [0, 18, 25, 28], "raw": [0, 11, 17, 24, 27, 28], "raymond": [0, 9], "rbl": [], "rdn": [], "re": [0, 3, 4, 7, 10, 11, 17, 22, 23, 24], "reach": 3, "reaction": 23, "readabl": 7, "reader": 11, "readi": 24, "real": [0, 2, 9, 12, 13, 15, 17, 19, 20, 23, 26, 27, 28], "realis": 8, "realist": [], "realiti": 16, "realiz": 19, "realli": [11, 19, 21, 22], "realm": [4, 24], "reaon": 12, "rearrang": [10, 21], "reason": [2, 6, 9, 11, 20], "recal": 6, "receiv": [8, 19], "recent": [4, 6, 7, 8, 9, 10, 13, 15, 16, 19, 20, 21, 22, 23, 24, 25, 28], "reciproc": 26, "recogn": [12, 19, 27], "recognis": [], "recognit": [0, 5, 9, 17, 27], "recommend": [0, 4, 5, 7, 10, 12, 17, 23, 25], "reconstruct": 14, "record": [], "recurr": [0, 9, 13, 22], "red": 21, "reduc": [6, 19, 20, 28], "refer": [0, 2, 11, 15, 19], "referenc": [], "refin": [23, 25], "reflect": [0, 6, 9, 12, 13, 19, 26, 28], "refram": 8, "regard": [7, 19, 27], "regardless": 28, "regener": 2, "regex": [], "regina": [], "region": [2, 11], "regress": [14, 18, 19], "regular": [11, 26], "reinforc": 19, "reinvent": 11, "reiss": [], "rel": [8, 9, 16, 17], "relat": [15, 16, 19, 20, 23, 27, 28], "relationship": [6, 9, 17, 21, 25, 28], "relax": 5, "releas": 21, "relev": [11, 15, 18, 19, 24, 25, 26, 27, 28], "reli": [6, 8, 12, 13], "reliabl": [6, 26], "relianc": 28, "religi": 20, "relu": [4, 24], "remain": [8, 11, 16, 20, 23, 25, 28], "remark": 15, "remedi": 19, "remez": [0, 18], "remi": 13, "remind": 28, "remov": 11, "ren": [], "render": 10, "renum": 5, "renumics___song": [], "rep": [], "repeat": 5, "repeatedli": 23, "repetition_penalti": 4, "replac": [11, 19], "repo": [], "report": [0, 16, 18], "repositori": 15, "repres": [7, 12, 13, 16, 17, 19, 21, 25, 28], "represent": [0, 4, 6, 8, 12, 15, 17, 19, 28], "repurpos": 2, "request": [16, 23], "requir": [2, 7, 8, 11, 12, 15, 19, 20, 23, 25, 26, 27, 28], "requires_grad": [4, 24], "requires_grad_": 4, "rer": [0, 13], "resampi": [], "research": [0, 3, 9, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28], "reshap": 4, "residu": [11, 14], "resize_token_embed": 4, "resnick": [0, 17], "reso": [], "resolut": [2, 11, 14], "resort": 21, "resourc": [7, 10, 19], "respect": [7, 8, 9, 28], "respons": [0, 5, 6, 8, 9, 12, 16, 19, 22, 23, 25], "respos": 8, "rest": [10, 11], "restrict": [8, 26, 27, 28], "result": [6, 8, 11, 12, 17, 19, 23, 26, 27, 28], "retain": 8, "rethink": [], "retriev": [0, 3, 6, 7, 8, 9, 13, 15, 16, 22, 26, 28], "retrieval_fn": 24, "return": [4, 5, 24, 26, 27], "return_tensor": [4, 24], "reus": 19, "reveal": 23, "revers": [0, 11], "review": [0, 3, 5, 6, 7, 8, 9, 11, 16, 18, 23, 28], "revisit": 23, "reward": 19, "rewon": [0, 18], "rez": 0, "rfb15": [], "rfer": [], "rg": [], "rgy": [0, 8, 9, 18, 28], "rhythm": [0, 2, 5], "rhythmic": 2, "ricardo": [], "riccardo": [], "rich": [16, 28], "richard": [], "richardson": [0, 9], "richer": [17, 19, 23, 28], "rif": [0, 13], "riff": [0, 2, 5, 24], "riffus": [0, 13], "rigel": [], "right": [5, 8, 11, 20], "rightarrow": [9, 11], "rinon": [], "rise": 17, "risk": 20, "rita": [0, 8], "rithesh": [0, 17], "ritter": [], "rkh": [0, 16, 28], "rkx": [], "rlhf": 19, "rlj": [], "rm": 19, "rmh": [], "rn": [], "rnn": [0, 8, 13, 21, 22], "rob92": [], "robbin": [], "robert": [0, 5, 17, 18], "roberta": [0, 21, 24, 28], "roberta_emb": 24, "robin": [], "roblek": 0, "robust": [25, 26, 28], "robustli": [0, 24, 28], "rock": [0, 2, 4, 5, 16, 17, 18, 24, 27, 28], "rod": [], "rodol": 0, "roform": [], "roger": [0, 8], "rohan": [0, 8], "role": 18, "roll": 5, "romain": [0, 8, 9], "rombach": [], "ron": 0, "rongchen": [0, 5, 8, 9], "rongji": [], "ronneberg": [], "room": 2, "root": [4, 24], "roshan": [], "rot92": [], "rotari": [], "rothstein": [], "roug": 6, "rouge_l": [], "rough": 11, "round": [6, 10], "rout": 8, "roux": 0, "rovan1997igm": 0, "row": 21, "royalti": [], "rpd": [], "rpg": [], "rsr": [0, 18], "rubinstein": [], "ruff": [], "rui": [], "ruibin": [0, 8], "ruihan": 0, "ruiz": [], "rule": 0, "run": [2, 10, 11, 15], "runner": [], "runtim": [4, 24], "runtimeerror": [], "russel": [0, 5], "rvq": 14, "rvqgan": 0, "rwc": [0, 18], "rwd97": [0, 13], "rxl": [], "s3f": [], "s4": 13, "s41592": [], "s_": 11, "sa": 5, "sabet": [], "sabour": [], "sadeep": [], "safehttpx": [], "safer": 20, "safetensor": [], "sageev": 0, "saharia": [], "sai": [8, 10, 21, 23], "sain": 0, "saito": [], "sakkeer": [0, 5, 8], "sal": [], "salamon": [0, 5, 28], "salient": [2, 8, 15], "saliman": 0, "salmonn": [0, 8], "sam": [0, 18], "same": [8, 11, 12, 16, 19, 20, 21, 28], "sameer": 0, "sami": [], "sampl": [2, 4, 8, 10, 11, 12, 19], "sample_r": 10, "sample_s": 10, "sampler": 10, "sampler_typ": 10, "samplernn": [0, 13, 17], "samuli": [], "san": 15, "sanakoyeu": [], "sander": [0, 11, 17], "sandhini": [0, 18, 28], "sandler": [0, 8, 9], "sang": [], "sanja": [], "sanjiv": [], "saroufim": [], "sashimi": 13, "sastri": [0, 28], "satisfact": [16, 26], "satisfi": [23, 26], "sauer": [], "saurabh": [], "saurou": [], "save": [10, 19], "savitzki": [], "saw": 19, "saxena": [], "saxophon": 28, "sbd": [], "sbr22": [], "sc": [], "scalabl": 0, "scale": [0, 8, 9, 10, 11, 12, 18, 28], "scatter": 16, "scc": [], "scdbk24": [0, 8], "scenario": [15, 23, 26, 28], "scene": [6, 11], "sch": [], "schedul": [], "schelten": [], "scheme": [8, 21], "schl": [], "schmidt": [0, 9], "schneider": [], "schoenfeld": [], "schulman": [], "schuster": [], "scienc": [0, 15, 23], "scientif": [], "scikit": [], "scikit_imag": [], "scikit_learn": [], "scipi": [], "scope": [16, 26, 28], "score": [0, 6, 8, 11, 19, 23, 26, 27], "scoroda18": [], "scott": [0, 9], "scratch": [2, 19, 23], "sd": 11, "sdcs23": [], "sdd": 5, "sde": [10, 11], "sdk": [], "sdwmg15": [], "search": [7, 15, 17, 18, 19, 23, 24, 25, 27, 28], "seb": [], "sebastian": [], "sec": [], "second": [0, 4, 8, 10, 11, 12, 17, 19, 21, 24, 27], "seconds_start": 10, "seconds_tot": 10, "secret": 10, "section": [4, 8, 9, 13, 14, 16, 19, 21, 22, 28], "see": [4, 6, 7, 8, 9, 11, 19, 21, 22], "seed": 10, "seek": [0, 2, 9, 11, 15, 23], "seem": 11, "seen": [2, 4, 8, 16, 19, 28], "seetharaman": 0, "segment": [9, 21, 22], "select": [12, 28], "self": [0, 4, 8, 10, 11, 15, 16, 19, 24, 28], "semant": [0, 6, 16, 17, 18, 19, 27, 28], "semantic_vers": [], "semanticscholar": 0, "semi": [0, 9], "senior": [0, 17], "sens": [11, 19, 21, 22, 24], "sensit": [12, 26], "sentenc": [6, 8, 9, 13, 16, 18, 21], "sentence_transform": 16, "sentencepiec": 21, "sentencetransform": 16, "sentri": [], "sentry_sdk": [], "seong": [], "separ": [0, 8, 11, 15, 19, 21, 27], "sepp": 0, "sequenc": [0, 8, 9, 10, 11, 19, 22, 28], "sequenti": [0, 4, 8, 24], "sergei": [], "sergio": [0, 28], "seri": [8, 11, 15], "serra": [0, 9, 28], "serv": [12, 14, 15, 16, 23, 25], "server": 15, "session": [0, 4, 8, 12], "set": [0, 5, 6, 8, 9, 10, 12, 18, 22, 25, 26, 27], "seth": 0, "setproctitl": [], "setup": 12, "setuptool": [], "seungheon": [0, 3, 5, 8, 15, 18, 23, 25, 28], "seungheond": 10, "seungjun": [], "seventh": [0, 8], "sever": [6, 7, 8, 9, 12, 19, 23, 25, 26, 28], "sexual": 20, "seybold": [], "seyedhosseini": 0, "sfg": [], "sfjb21": [], "sfk24": [], "sft": 19, "sg64": [], "sgz": [0, 12], "sh22": [], "shan": [0, 5, 8], "shang": [], "shansong": [0, 5, 8], "shaohan": [0, 28], "shaoq": [], "shaoshu": [], "shaozh": [], "shape": [4, 24, 28], "sharan": [0, 18], "share": [6, 17, 28], "sharifi": 0, "shawn": [], "shayn": [0, 18], "shazeer": [0, 18], "she": [4, 15], "shechtman": [], "sheld": 11, "shelf": 11, "shellingham": [], "shen": [], "sheng": [], "shengfeng": [], "sherlock": [], "shi": [], "shibuya": [], "shift": [7, 9, 11, 17, 21, 28], "shih": [0, 18], "shihao": [], "shiliang": [0, 8], "shinji": [0, 18], "ship": 19, "shiqi": [0, 6, 8], "shiran": [], "shivam": [], "shjl24": [], "shkk22": [], "shlomo": [0, 8, 9, 18, 28], "short": [0, 4, 9, 17], "shortcom": [6, 7], "shortli": 8, "shot": [0, 8, 9, 18, 28], "should": [4, 9, 10, 11, 15, 19, 25, 26], "show": [6, 16, 19, 26, 27, 28], "shown": [5, 8, 13, 19, 20, 21, 28], "shrirao": [], "shu": [0, 18, 25], "shuai": [0, 28], "shubham": [0, 17], "shuffl": [4, 24], "shunt": [], "shuo": [0, 6], "shuqi": [], "shusuk": [0, 6, 8], "shutterstock": 8, "shyamal": [0, 18], "si": [], "siang": 0, "sicong": [], "siddhartha": [0, 18], "side": 2, "siggraph": [], "sigir": [0, 18, 25], "sigma_max": 10, "sigma_min": 10, "signal": [0, 4, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 28], "signatur": 13, "signifi": 15, "signific": [15, 23, 25, 27, 28], "significantli": [19, 28], "sil": [], "silenc": 10, "sim": 11, "simian": [], "similar": [0, 2, 6, 8, 11, 12, 17, 19, 21, 23, 24, 25, 26, 27], "similarity_metr": 24, "similarli": [6, 8, 28], "simon": [0, 8, 9, 18], "simonetta": [], "simonyan": [0, 17], "simpl": [0, 2, 4, 9, 13, 18, 19, 21, 25, 28], "simple_contrastive_loss": 24, "simpler": [10, 11, 21], "simplest": [2, 8, 9], "simpli": [2, 4, 8, 10, 11, 19, 22], "simplifi": [], "simultan": [0, 13, 16, 28], "sinc": [8, 11, 12, 13, 19, 21, 27], "sing": [0, 4, 15, 24], "singapor": [0, 8], "singer": [0, 4, 5, 24], "singh": [0, 8], "singl": [0, 8, 9, 10, 11, 12, 16, 17, 18, 26, 28], "singsong": [0, 2], "siraichi": [], "site": 10, "situat": [12, 19, 22], "sivic": [0, 5], "six": [], "siyu": [], "siyuan": [], "size": [4, 5, 8, 9, 11, 19, 21, 24, 28], "sjscholkopf23": [], "sk": [], "skals": [], "sketch": 8, "sketchnet": [0, 13], "skip": 19, "skip_special_token": 4, "sklearn": 16, "skoglund": [], "skyrocket": 4, "slama": [0, 18], "slbr23": [], "slc07": [0, 17], "slightli": [8, 21, 23], "slow": [19, 24], "slowdown": 11, "small": [4, 9, 11, 12, 19, 21, 26], "smaller": [19, 28], "sme20": [], "smith": [], "smitin": [0, 2], "smmap": [], "smollei": [], "smooth": 28, "snake": 11, "sne": 16, "sniffio": [], "so": [2, 4, 6, 8, 10, 11, 15, 19, 21, 22], "so17": [0, 13], "soar": 4, "social": [0, 16, 17], "societi": [0, 6, 8, 9, 15, 18, 20, 23], "soft": [5, 16, 28], "softmax": [8, 27], "softwar": 15, "soham": [0, 8], "sohl": 0, "sojoudi": [], "solid": 15, "solo": [4, 5, 28], "solut": [3, 18, 19], "solv": [11, 16, 19, 20, 21], "solver": 11, "somayeh": [], "some": [2, 4, 5, 6, 7, 8, 9, 10, 11, 13, 17, 19, 21, 22], "someth": [11, 19], "sometim": [8, 9, 19], "son": [], "song": [0, 2, 4, 5, 8, 9, 10, 23, 24, 27, 28], "soni": [13, 15], "soon": 13, "sophist": [3, 4, 8, 25, 28], "soprano": 4, "sora": 19, "sordo": [0, 17], "soroush": [0, 17], "sort": [2, 11, 26], "sot": 21, "sotelo": [0, 17], "soujanya": [0, 5], "soumith": [], "sound": [0, 4, 5, 8, 9, 11, 16, 17, 18, 24, 27, 28], "soundctm": [], "soundfil": [], "soundstorm": [], "soundstream": [], "sourc": [0, 5, 15, 17, 19], "sourcetensor": 4, "sourish": [], "southern": 15, "souza": [], "space": [0, 3, 4, 8, 11, 13, 16, 19, 23, 28], "spam": 20, "span": [7, 15, 18], "spanish": 24, "speak": 11, "special": [16, 19, 21, 25, 28], "specialis": 8, "specif": [2, 3, 4, 6, 8, 9, 12, 13, 15, 17, 18, 19, 21, 22, 27, 28], "specifi": 23, "speck": [0, 9], "spectral": 11, "spectrogram": [0, 11, 14, 21, 28], "spectrum": [13, 23], "speech": [0, 5, 8, 9, 12, 13, 17, 18, 19, 27, 28], "spend": 19, "spice": 6, "spider": 6, "spijkervet": 0, "spirit": 11, "spl": 0, "split": [4, 24], "spm": 15, "spotifi": 15, "springer": [0, 8, 28], "sqrt": 8, "squar": 21, "sr": 10, "src": 5, "srikumar": [0, 6, 8], "srivatsan": [0, 8], "ssdk": [0, 11], "ssw": 0, "stabil": [10, 15], "stabilityai": 10, "stabl": [0, 2, 11, 16], "stable_audio_tool": 10, "stableaudio": 13, "stack": 11, "staff": 15, "stage": [4, 8, 19], "standard": [2, 4, 6, 8, 11, 21, 22, 26], "stanlei": [], "starlett": [], "start": [6, 8, 9, 10, 11, 13, 17, 21, 22, 23], "startup": 13, "stasyuk": [], "state": [0, 2, 8, 9, 13], "static": [6, 9, 15, 19, 28], "statist": 12, "steadi": 20, "steer": 23, "steerabl": 0, "stefan": 0, "stefano": 0, "steinmetz": [], "stem": [2, 28], "stemgen": [0, 2], "step": [0, 6, 8, 10, 11, 12, 14, 19, 20, 22, 26], "stephen": [0, 17], "stereo": 0, "steven": [0, 5, 8, 9, 18], "stft": [11, 14], "still": [8, 12, 16, 19, 20, 21, 23], "stimulu": 12, "stitch": [], "stochast": [0, 11], "stoi": [], "stoller": [0, 8, 9, 18], "stop": 10, "store": [], "stori": 24, "straight": 19, "straightforward": [12, 21], "strategi": [0, 6, 18, 28], "strength": [11, 12], "strictli": 10, "string": [4, 11, 24], "strong": [2, 20, 24], "stronger": 28, "strongli": [2, 6], "strub": [], "struct": 7, "structur": [0, 2, 9, 13, 19, 23, 27, 28], "struggl": [19, 23, 28], "strum": 24, "student": [15, 19], "studi": [3, 6, 9, 11, 15, 17, 23, 27, 28], "style": [2, 9, 11, 13, 17, 23, 28], "su": [], "sub": [9, 21, 22], "subject": [0, 5, 6, 12], "sublinear": [], "submit": 19, "subscript": 22, "subsequ": [6, 13, 23], "subset": [4, 21, 24, 27], "substanti": [15, 28], "substr": 22, "subtl": [12, 28], "subword": [21, 28], "succeed": 3, "succes": 11, "success": [8, 21, 22, 25], "suffici": [19, 25, 27, 28], "suggest": [4, 8, 10, 23, 25], "suha": [], "suhail": [], "suit": 6, "suitabl": [5, 8, 9, 12], "suk": [], "sum": [4, 24], "sum_": [8, 26, 28], "sumbali": [], "summar": [12, 19], "summari": [7, 9], "summaris": 6, "summit": 20, "sun": [0, 5, 6, 8, 13], "sungroh": [], "sunni": [5, 28], "suno": [0, 13], "suo": [], "supasorn": [], "super": [4, 24], "superior": 12, "supervis": [0, 8, 9, 15, 17, 18, 19, 21, 27, 28], "supplement": [5, 15, 24], "suppli": 21, "support": [7, 10, 17, 20, 23, 24, 25], "suppos": 21, "sure": [4, 10, 24], "surround": [19, 21, 22, 28], "survei": [0, 15, 17], "surya": [], "sustain": 25, "sutskev": [0, 18], "suttisak": [], "suwajanakorn": [], "svn37": [], "swap": [0, 28], "swave": [], "sweep": 10, "sweet": 3, "swerski": [], "swiss": [0, 6, 8], "swy": [], "sylvain": [], "symbol": [0, 13], "sympi": [], "synchron": [0, 18], "synnaev": [0, 18], "syntact": 6, "synth": 5, "synthes": 13, "synthesi": [0, 13, 17], "synthet": [0, 7, 25], "system": [0, 2, 4, 6, 7, 8, 9, 10, 12, 16, 17, 18, 19, 23, 24, 26, 27, 28], "szk": [], "szu": [0, 17, 18], "t": [0, 2, 3, 5, 6, 8, 9, 10, 11, 16, 18, 19, 21, 23, 24, 27, 28], "t1": 10, "t5": [11, 14, 16, 19, 21, 22], "t_i": 28, "t_j": 28, "tabl": [8, 19, 21], "tackl": [8, 28], "taehong": [], "taesu": [0, 23, 25], "taesung": [], "tag": [0, 2, 4, 5, 9, 16, 17, 18, 24, 27], "tagliasacchi": [0, 5, 18], "tai": [0, 18], "taigman": [0, 17], "tak": 8, "takahashi": [0, 6, 8], "takashi": [], "take": [2, 4, 6, 7, 8, 11, 14, 19, 20, 27, 28], "takida": 0, "tal": [0, 18], "talent": [4, 24], "tali": [], "talk": [0, 2, 7, 19, 25], "tallini": 0, "tan": [0, 8], "tang": [0, 8], "tanh": 8, "tao": [0, 8], "tar": [], "target": [8, 11, 14, 27], "task": [0, 3, 4, 6, 7, 8, 11, 12, 13, 14, 15, 18, 20, 21, 26, 27], "taslp": 15, "tat": [], "tau": 28, "taylor": [0, 8, 15, 18, 28], "tb": 10, "tb_name": 10, "tbtl08": [0, 17, 18, 27], "tc02": [0, 9], "te_dataload": [4, 24], "tea": 5, "teach": [0, 15, 17, 18], "teacher": 19, "teboul": [], "tech": [], "technic": [0, 15, 18, 22, 23], "techniqu": [0, 8, 12, 15, 18, 21], "technologi": [0, 15, 17, 18, 23, 27], "teh": 2, "tejasvi": [0, 18, 25], "telecommun": [0, 12], "tell": 24, "temperatur": [4, 24], "templat": 5, "tempo": [5, 16, 17, 24, 27], "tempor": [0, 2, 4, 5, 8, 9, 11], "ten": 19, "tenac": [], "tencent": 15, "tend": [2, 20, 23], "tendenc": 26, "tenenbaum": [], "tensor": [4, 24], "tensor_numpi": [], "tensorboard": [], "tensorboard_data_serv": [], "teoh": [], "ter": [], "term": [0, 2, 6, 7, 8, 9, 11, 12, 16, 17, 18, 19, 27, 28], "termcolor": [], "tero": [], "test": [4, 19, 20, 24], "test_dataset": [4, 24], "tester": 12, "teuwen": [], "text": [0, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 22, 23, 24, 26, 27], "text2song": 2, "text_embedding_dim": [4, 24], "text_encod": 24, "text_forward": 24, "text_model": 4, "text_output": 24, "text_project": 24, "text_token": [4, 24], "textrm": [11, 22], "textsubscript": 6, "textual": [0, 2, 12, 13, 18, 28], "textur": 23, "textwrap": 4, "tf": 6, "th20": [], "thabet": [], "thabo": [], "thailand": [0, 6, 8], "than": [2, 4, 8, 9, 11, 12, 15, 16, 17, 19, 20, 23, 27, 28], "thang": [], "thank": [16, 28], "thdl24": [0, 13], "thei": [4, 6, 7, 8, 9, 11, 15, 16, 19, 21, 22, 23, 25, 28], "them": [2, 4, 6, 8, 10, 11, 12, 13, 19, 21, 28], "theme": [17, 28], "theori": 8, "therebi": 15, "therefor": [8, 9, 12], "thereof": 8, "theres": 11, "thermodynam": [], "thesi": [0, 15, 27], "theta": [8, 11, 22], "the\u00e2": [0, 8], "thi": [2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "thibault": [], "thickstun": 0, "thierri": [0, 17], "thing": [2, 4, 10, 11, 20, 24], "think": [0, 4, 8, 9, 11, 19, 28], "third": [0, 8], "thirti": [0, 8], "thoma": 0, "those": [8, 9, 19, 20, 21, 22], "though": [8, 9], "thread": 8, "threadpoolctl": [], "three": [5, 12, 13, 19, 21, 23], "threshold": 26, "through": [0, 2, 3, 4, 5, 7, 8, 9, 10, 13, 15, 16, 17, 18, 19, 20, 23, 24, 25, 27, 28], "throughout": 5, "throughput": 20, "thu": [11, 15], "ti": 8, "tian": [0, 8], "tianqi": [], "tianwei": [], "tianxiang": [0, 8], "tianyu": [0, 28], "tie": [0, 9], "tier": 15, "tifffil": [], "tight_layout": 16, "tillet": [], "tim": 0, "timbr": [2, 16], "timbretron": [], "time": [0, 2, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 27, 28], "timeless": 4, "timelin": 13, "timescal": [8, 9], "timestep": 11, "timo": [0, 5, 18], "ting": [0, 17], "tinghui": [], "tip": 24, "titl": 16, "tl89": [0, 13], "tn": 26, "to_html": 5, "todai": [7, 8, 24], "todd": 0, "todo": [], "togeth": 28, "token": [0, 3, 4, 6, 8, 9, 10, 11, 14, 19, 20, 28], "tokenization_utils_bas": [4, 24], "tokenizers_parallel": 10, "tom": 0, "tomer": 0, "tomlkit": [], "tommi": [], "too": [4, 19, 22, 24, 26], "tool": [4, 10, 13, 18, 25], "toolkit": [], "top": [8, 15, 19, 21, 26, 28], "top_k": 4, "top_p": 4, "topic": [3, 9, 13, 15, 19, 23], "topk": 24, "torch": [4, 10, 16, 24], "torch_stoi": [], "torchaudio": [4, 10, 24], "torchdiffeq": [], "torchlibrosa": [], "torchmetr": [], "torchsd": 10, "torchvis": 10, "tornado": [], "torr": [0, 17, 18, 27], "toshimitsu": 0, "total": 10, "total_loss": [4, 24], "toutanova": [0, 18, 28], "tov": [], "tovstogan": [0, 5, 23], "toward": [0, 5, 7, 8, 9, 11, 17, 18, 20, 23], "to\u00e2": [0, 8], "tp": 26, "tqdm": [4, 24], "tr": [], "tr_dataload": [4, 24], "trace": [3, 7, 18], "traceback": [4, 10, 16, 24], "track": [0, 5, 6, 8, 9, 17, 27, 28], "track2emb": [4, 24], "track_id": [4, 24], "tradeoff": 20, "tradit": [3, 13, 25, 27], "tradition": 9, "train": [0, 2, 3, 5, 6, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27], "train_dataset": [4, 24], "train_loss": [4, 24], "train_parma": [4, 24], "traitlet": [], "trajectori": [0, 23], "trampolin": [], "tran": 0, "transact": [0, 9, 17, 18, 27], "transcript": 15, "transfer": [0, 2, 13, 18, 21, 28], "transform": [0, 4, 5, 8, 9, 11, 13, 14, 16, 18, 19, 21, 22, 24, 28], "transit": 18, "translat": [0, 4, 6, 7, 8, 11, 17, 19], "transpar": [], "transport": 4, "treat": [3, 8, 11, 21, 23, 27], "tremend": 21, "trend": [8, 13], "tri": [], "trick": 2, "trigger": [], "triplet": [3, 28], "true": [4, 10, 16, 24, 26], "truli": [2, 4], "truncat": [4, 24], "trung": [], "trust": 24, "truth": [4, 6, 24, 26], "try": [4, 16, 22, 24], "tsa": [0, 9], "tsai": [], "tsne": 16, "tsung": [0, 8, 9], "ttm": [2, 11], "ttmr": 16, "tu": [], "tune": [4, 8, 19, 24], "tuoma": [], "tupl": 9, "turab": 0, "turbo": [], "turn": [9, 10, 11, 18, 21], "turnbul": [0, 9, 17, 18, 27], "tutoir": [], "tutori": [2, 4, 9, 11, 13, 15, 16, 17, 19, 21, 22, 24], "twelfth": [0, 8], "two": [0, 3, 7, 8, 12, 13, 14, 17, 21, 27, 28], "txt": 15, "ty": [0, 8], "type": [4, 5, 7, 8, 12, 13, 21], "typer": [], "typic": [6, 7, 8, 9, 12, 19, 23, 27, 28], "tzanetaki": [0, 9], "tzdata": [], "tzg": [0, 2], "u": [3, 4, 11, 19, 20, 22, 28], "uc": [], "ucsd": [], "udi": [0, 13], "udio": [0, 13], "uesaka": 0, "ugen": [], "uh": [], "ultim": 12, "umap": [], "umap_learn": [], "umbrella": [7, 8], "umg": 15, "un": 21, "unabl": 6, "unannot": 27, "unattribut": 2, "uncommon": 21, "uncondit": [0, 17], "under": [8, 15, 27], "underbrac": 22, "undergo": 8, "understand": [0, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 18, 22, 23, 25, 26, 28], "understnad": 24, "understood": 20, "unequivoc": 8, "unfamiliar": 28, "unfeas": 27, "unforgett": 21, "unfortun": 23, "uni": [], "uni01": [0, 12], "unifi": [0, 8, 16, 18], "unigram": 6, "union": 0, "uniqu": [3, 12, 20], "unit": [0, 6, 8, 9, 28], "univ": [], "univers": [0, 6, 8, 15, 17, 18, 28], "unknown": [21, 28], "unlabel": [7, 21], "unlik": [11, 12, 19, 21, 25, 28], "unlimit": 28, "unrel": 28, "unresolv": 15, "unrestrict": 27, "unrol": 8, "unsatisfactori": 23, "unseen": [17, 21], "unsqueez": 4, "unstabl": 2, "unsupervis": [0, 4, 18], "unterthin": 0, "until": [11, 22], "unus": 2, "up": [2, 6, 10, 11, 19, 20, 21], "upbeat": [2, 5, 23, 24, 25, 27, 28], "updat": [0, 4, 19], "upend": 2, "uplift": 5, "upload": 10, "upon": 23, "uriel": 0, "url": [0, 5, 6, 8, 9], "urllib3": [], "urtasun": [], "us": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28], "usa": 15, "usabl": 16, "usag": [17, 20, 26], "usai": [], "user": [0, 6, 8, 9, 10, 15, 16, 18, 19, 23, 24, 25, 26, 27, 28], "userwarn": [4, 10], "usic": [], "usr": [4, 24], "usual": [6, 7, 8, 9, 22, 27], "uszkoreit": 0, "utf": 5, "util": [4, 14, 24, 28], "utilis": 8, "uvicorn": [], "v": [0, 8, 9, 11, 14, 18], "v1": [0, 6, 8, 9], "v2": [], "v3": [], "v4": [4, 24], "v_diffusion_pytorch": [], "va": [], "vae": [0, 11, 19], "vahdat": [], "vajda": [], "valid": [4, 6, 24], "valid_loss": [4, 24], "vall": [0, 8], "valu": [6, 11, 12, 14, 19, 26, 27], "valuabl": [12, 23, 25], "vampnet": [0, 2, 13], "van": [0, 17, 28], "vandergheynst": [], "vari": [0, 2, 8, 11, 18], "variabl": [9, 10, 11], "varianc": 12, "variant": [6, 7, 9], "variat": [0, 6, 8, 9, 19, 23], "varieti": [7, 8, 16, 19, 21, 23], "variou": [5, 15, 16, 18, 19, 27, 28], "varun": [], "vascipy10contributors20": [], "vast": 28, "vastli": 21, "vasudevan": 0, "vaswani": 0, "vdodz": [0, 13, 17], "vdov": [], "ve": [2, 3, 4, 8, 24], "vector": [0, 2, 8, 11, 14, 18, 19, 24, 28], "vector_quantize_pytorch": [], "veit": [], "ventur": 2, "venv": [], "veri": [6, 10, 19, 21, 27], "vers": 2, "versatil": [], "version": [4, 11, 12, 28], "versu": 28, "verzetti": [0, 5, 18], "vesa": [], "vggish": 12, "via": [0, 5, 6, 8, 14], "vibe": [5, 24], "vicki": 0, "vicol": [], "video": [0, 5, 7], "videocrafter1": [], "view": [19, 27], "vijai": 0, "vincent": [0, 18], "vinh": [], "vinyal": [0, 17, 28], "violin": [4, 24], "virtanen": [], "virtual": [10, 19, 28], "visheratin": [], "visio": 0, "vision": [0, 5, 6, 16], "visit": [0, 5, 6, 8, 9], "visual": [0, 8, 16, 28], "vivek": [0, 6, 8], "vocabulari": [9, 16, 17, 18, 21, 28], "vocal": [2, 4, 5, 24, 28], "vocalist": [4, 24], "vocod": [0, 11], "voic": [2, 4, 24], "volkmann": [], "volum": [0, 2, 5, 6, 8, 9, 19], "voss": [], "voznesenski": [], "vq": 19, "vqgan": 19, "vri": [], "vsp": [0, 14], "vulner": 24, "w": [0, 8, 11, 16], "wa": [4, 13, 15, 17, 19, 21, 24, 28], "wade": [], "wai": [2, 3, 5, 6, 8, 11, 13, 15, 16, 18, 19, 21, 23, 24, 27], "wainwright": [0, 18], "wakaki": [0, 6, 8], "walk": [0, 11, 25], "wallac": [], "wandb": [], "wang": [0, 6, 8, 18], "wang_self": [], "wanmo": [], "want": [2, 4, 5, 10, 17, 19, 21, 22, 23, 24, 27], "warn": [4, 10, 16, 24], "watanab": [0, 18], "watson": [], "wattenhof": [0, 8], "wav": 5, "wav2vec2featureextractor": 4, "waveform": [11, 17, 28], "wavegan": 17, "wavenet": [0, 13, 17, 19], "wbz": [0, 18], "wcmb": [], "wcs21": [0, 9], "wcwidth": [], "wcy22": [], "wcz": [0, 12, 28], "wdwb23": [0, 2, 11, 18], "wdwb24": [], "we": [2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 27, 28], "weak": [0, 12], "weaker": 16, "web": [0, 15, 19, 28], "webdataset": [], "webencod": [], "websocket": [], "weck": [0, 5, 6, 8, 23, 28], "weer": 0, "wei": [0, 6, 8, 9, 18, 28], "wei_finetuned_2021": [], "weigh": [8, 28], "weight": [4, 6, 8], "weihao": [], "weikang": [], "weili": [], "weiner": 11, "weird": 11, "weiss": [], "weituo": [0, 8, 9], "welcom": [15, 24], "well": [2, 4, 6, 8, 11, 12, 15, 16, 18, 19, 20, 22], "wen": [0, 5, 8, 9], "weng": [], "wenhao": [0, 5, 8, 9], "wenhu": [0, 5, 8, 9], "wenrui": [0, 6], "wenwu": 0, "wenyi": [0, 8], "wenyu": [0, 6], "were": [3, 6, 17, 19, 21, 22, 23, 27, 28], "werkzeug": [], "wgen23": [], "wget": [], "wgn23": [], "whang": [], "what": [8, 9, 11, 19, 21, 23, 27], "whb": [], "when": [3, 6, 7, 8, 9, 12, 14, 18, 19, 21, 22, 23, 26, 27, 28], "whenev": 17, "where": [2, 4, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28], "wherea": 22, "whether": [9, 17, 19, 26], "whi05": [0, 27], "which": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 26, 28], "whiil": 2, "while": [2, 6, 8, 11, 12, 15, 16, 19, 20, 21, 23, 25, 28], "whisper": [15, 19, 21], "whistl": 5, "whitman": [0, 17, 27], "whl": [], "who": [2, 4, 24, 28], "whole": [9, 19], "wht24": [], "why": [11, 15, 19, 24], "wichern": 0, "wide": [7, 8, 12, 16, 17, 18, 19, 21], "wider": [6, 11, 16], "widmer": [], "width": [4, 11], "wikimut": [0, 5, 28], "wikipedia": 5, "wilei": [], "william": [0, 18], "wimbauer": [], "winter": [], "wise": [2, 11], "wish": [3, 8, 11], "within": [2, 4, 8, 9, 16, 26, 28], "without": [2, 6, 8, 13, 16, 18, 19, 23, 28], "wizadwongsa": [], "wjt": [], "wkgs24": [0, 28], "wmb": [0, 6, 8], "wmd": [], "wnn": [0, 5, 8, 9], "wojciech": 0, "wolf": [0, 17], "womanli": 4, "women": 4, "won": [0, 5, 9, 10, 18, 23, 28], "wong": [], "wook": [0, 15, 18, 28], "word": [0, 2, 11, 16, 17, 18, 19, 21, 22, 27, 28], "work": [2, 3, 8, 9, 11, 13, 15, 19, 21, 22, 24, 25, 28], "workflow": [2, 15], "workshop": [0, 8], "world": [4, 9, 15, 17, 19, 23, 26, 27, 28], "worst": 19, "worth": 8, "would": [8, 9, 11, 23, 27, 28], "wrap": [2, 4, 10], "wrapper": 10, "wrapt": [], "wright": 0, "write": [2, 4, 16], "writeup": 11, "written": [], "wte": 4, "wu": [0, 5, 8, 9, 18, 23, 28], "ww": [], "www": 0, "wxfx24": [], "wy23": [], "wyi3wkzjy": [0, 8], "wzl": [0, 6], "wzz": [0, 13], "x": [8, 9, 11, 17, 24], "x_": 28, "x_0": [], "x_1": [], "x_2": [], "x_m": [], "x_t": [], "x_transform": [], "xattn": 8, "xavier": [0, 9, 28], "xi": 0, "xia": 0, "xiang": [], "xiangyu": [], "xianzhao": [0, 8], "xiao": [], "xiaob": [0, 9], "xiaodong": [], "xiaogang": [], "xiaohuan": [0, 6, 8], "xiaoliang": [], "xiaoyu": [], "xie": [0, 5, 8, 9, 28], "xin": [0, 8], "xinchao": [], "xing": [], "xingjian": [], "xinhao": 0, "xintao": [], "xinyi": [0, 8], "xinyin": [], "xipeng": [0, 8], "xiufeng": [], "xu": [0, 6, 8, 18], "xubo": 0, "xuchan": [], "xuchen": [0, 8, 9], "xudong": [], "xue": [], "xuefeng": [], "xuezhi": [0, 18], "xun": [], "xunlong": [0, 6], "xyzservic": [], "xzy": [], "y": [0, 8, 9, 17, 18, 24], "y_": [8, 9], "y_1": [8, 9], "y_2": [8, 9], "y_t": [8, 9], "yael": [], "yan": [0, 8], "yanbo": [], "yang": [0, 6, 8, 17, 18], "yaniv": [0, 17], "yanqi": [0, 18], "yanzuo": [], "yao": [], "yaofang": [], "yariv": [], "yarl": [], "yaron": [], "yatong": [], "yazh": [0, 28], "ycontributors21": [], "ycy17": [0, 13], "ye": [0, 8, 27], "year": [7, 8, 9, 11, 22], "yee": [], "yellow": 13, "yen": [], "yeong": [], "yeongmin": [], "yesil": 0, "yet": [4, 8, 17], "yeung": 0, "ygp": [], "ygz": [], "yi": [0, 17, 18], "yichong": [0, 6], "yichun": [], "yijin": [], "yilun": [], "yin": 0, "yinfei": [], "ying": [0, 5, 8], "yinghai": [], "yinghao": [0, 5, 8, 9], "yinhan": [0, 24, 28], "yinhuai": [], "yiqin": [], "yixiao": [0, 5, 23], "yiyi": 0, "yk": [], "yoav": [0, 8, 9], "yogesh": [], "yong": [], "yonghui": 0, "yoo": [], "yoon": [], "york": 15, "yoshua": [0, 17], "yossi": [0, 18], "you": [0, 3, 4, 5, 9, 10, 11, 15, 17, 19, 21, 24], "youngjung": [], "youngmoo": [0, 9], "your": [0, 3, 4, 6, 8, 10, 24], "your_hf_token": 10, "yourself": [10, 15], "youtub": [], "youtube8m": [5, 8], "yt": 5, "yt8m": [], "yu": [0, 8, 17, 18], "yuan": [0, 8, 28], "yuancheng": [], "yuanjun": [0, 6], "yuanzhen": [], "yuanzhong": [], "yuchen": [0, 28], "yudong": [0, 5, 8, 9], "yue": [0, 8, 9, 18, 28], "yueh": [], "yufeng": [], "yuhao": [], "yuhta": 0, "yujiu": [], "yukara": 0, "yuki": [0, 6, 8], "yukio": [], "yuliang": [], "yulun": [], "yume": [], "yun": [0, 9], "yunfei": [0, 6, 8], "yunfeng": [], "yunhua": [0, 8], "yunjei": [], "yunji": [], "yunxuan": [0, 18], "yupe": 0, "yuqe": [], "yusong": [0, 5, 18, 23, 28], "yutong": 0, "yuval": [], "yuxi": [], "yuxuan": 0, "ywv": [0, 16], "ywz": [], "yxk": [], "yxl": [0, 6], "z": 11, "z1": 24, "z2": 24, "z_audio": 24, "z_text": 24, "zach": 0, "zachari": [0, 3, 5, 8, 9, 15, 18], "zack": 0, "zackeri": 19, "zada": [], "zal": [0, 5, 18], "zaremba": 0, "zcc": [], "zcdb24": [0, 11], "zdy": [0, 8], "zeghidour": [], "zehua": 0, "zejun": [0, 8], "zen": [0, 17], "zeqian": [], "zero": [0, 11, 18, 28], "zero_grad": [4, 24], "zettlemoy": [], "zeyu": [], "zhan": [0, 8], "zhang": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "zhang_bertscore_2020": [], "zhao": [0, 6, 8, 18], "zhaoyang": [], "zhen": [], "zheng": [], "zhengdong": [], "zhengyuan": [0, 6], "zhenhui": [], "zhf": [], "zhi": [0, 6, 8], "zhide": [], "zhifeng": [0, 8], "zhigeng": [0, 8], "zhihong": [], "zhije": [], "zhiji": [0, 8], "zhijun": [0, 8], "zhiqi": [], "zhishuai": [], "zhiyao": 0, "zhizheng": [], "zhong": [0, 6, 8], "zhongyi": [], "zhou": [0, 6, 8, 18, 28], "zhouhang": [0, 5, 8, 9], "zhouyu": [0, 17], "zhu": 0, "zhuang": [], "zhuohan": [0, 6], "zhuoyuan": [0, 6, 8], "zihao": [0, 5, 8, 9], "zijian": [], "zip": [], "ziqi": [], "zirui": 0, "ziv": 0, "ziwei": [], "zix": [0, 2], "zixun": [0, 5], "ziyu": [0, 6], "zizheng": [], "zlo": [], "zongyu": [], "zoph": [0, 18], "zornitsa": [0, 8, 9], "zou": [0, 6], "zra23": [], "zuchao": [], "zukowski": 0, "zuluaga": 0, "zwcd23": [0, 11], "zzm": [0, 6, 8], "\u00e0": 0, "\u00e1": [0, 5, 18], "\u00e4": 0, "\u00e4\u00e4": [], "\u00e7": 0, "\u00e9": [0, 18], "\u00eb": 0, "\u00ed": 0, "\u00f6": [0, 8, 9, 18, 28], "\u00fc": [], "\u0103": [], "\u02c6": []}, "titles": ["Bibliography", "Beyond Audio Modality", "Beyond Text-Based Interactions", "Conclusion", "Code Practice", "Datasets", "Evaluation", "Introduction", "Models", "Tasks", "Code Tutorial", "Diffusion Model-based Text-to-Music Generation", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Advances", "Challenges", "The Framework", "Introduction", "Challenges", "Code Practice", "Conversational Retrieval", "Evaluation", "Introduction", "Models"], "titleterms": {"": [4, 24], "1": [4, 16, 24], "2": [4, 16, 24], "3": [4, 16, 24], "4": [4, 24], "A": [], "And": 7, "In": 19, "The": [7, 21], "about": 15, "abstract": 2, "adapt": [8, 21], "address": [], "advanc": 19, "aim": 15, "align": 19, "almost": 16, "anchor": 12, "annot": 17, "answer": 9, "appli": 28, "ar": [8, 22], "architectur": [8, 11, 24, 28], "attent": 21, "attribut": 28, "audio": [1, 10, 12, 14, 15, 28], "audio2audio": 2, "augment": [19, 28], "author": 15, "automat": [], "autoregress": 21, "ax": 7, "background": 17, "base": [2, 6, 11], "benchmark": 6, "benefit": [25, 28], "beyond": [1, 2, 28], "bibliographi": 0, "brief": [], "build": [4, 24], "call": 19, "caption": [9, 23], "chain": 19, "challeng": [20, 23, 25], "channel": 21, "class": [4, 24], "classif": 9, "code": [4, 10, 24], "codec": 14, "complex": [], "concaten": 21, "conclus": [3, 4, 24], "condit": [8, 11, 21], "connect": 15, "context": 19, "continu": 11, "control": 2, "convers": [9, 25], "creat": [4, 24], "cross": 21, "data": [4, 24, 28], "databas": [], "dataset": [4, 5, 24], "decod": [8, 19, 21], "definit": 13, "denot": [], "describ": [], "descript": [5, 7, 8, 9, 18], "dialogu": [], "diffus": 11, "direct": 25, "distanc": 12, "distil": 19, "distribut": 23, "divers": [12, 28], "do": 7, "don": [], "earli": [17, 27], "effici": 20, "embed": 28, "emploi": 28, "encod": [8, 16, 19, 21], "engin": 24, "environ": [4, 24], "evalu": [6, 12, 26], "exampl": [], "fad": 12, "fall": [], "feedback": 19, "fid": 12, "framework": 21, "friendli": 16, "from": 19, "fr\u00e9chet": 12, "function": [19, 28], "further": 24, "fusion": 8, "futur": 25, "gener": [11, 17, 18, 19], "get": [4, 15, 24], "handl": 28, "hidden": 12, "histori": 13, "human": [5, 16, 19], "i": [7, 16, 28], "implement": 21, "incept": 12, "infer": 24, "initi": 28, "input": 19, "instruct": [], "interact": 2, "interfac": 16, "introduct": [4, 7, 13, 22, 24, 27], "iter": 11, "joint": 28, "k": 21, "kei": 25, "label": 16, "langaug": [16, 18], "languag": [15, 19, 21, 22], "law": 19, "learn": [16, 19, 24, 28], "let": [4, 24], "leverag": 28, "limit": [6, 12, 19, 23, 25], "listen": 12, "ll": 24, "llm": 8, "load": 4, "loss": 28, "make": 24, "mask": 21, "match": 6, "mc": [], "mean": 12, "method": 27, "metric": [6, 28], "mismatch": 23, "mo": 12, "modal": [1, 28], "model": [4, 8, 11, 18, 19, 21, 22, 24, 28], "modul": 21, "motiv": 15, "mqa": [], "mtc": [], "multi": 28, "multimod": [8, 19], "multipl": 12, "mushra": 12, "music": [2, 5, 7, 8, 9, 11, 15, 17, 18], "musiccap": [], "musicgen": 14, "musictextclip": [], "nativ": 8, "natur": [15, 16], "need": 7, "neg": 28, "neural": 14, "normal": 21, "open": 10, "opinion": 12, "other": [], "our": [4, 24], "out": 27, "output": 19, "overview": [7, 18], "paradigm": [], "part": [], "perform": 20, "practic": [4, 24], "pre": 28, "precis": 26, "prefix": 21, "prerequisit": [4, 24], "problem": [13, 27], "qualiti": 12, "queri": [23, 28], "question": 9, "rag": 19, "reason": 19, "recal": 26, "refer": [5, 6, 8, 9, 12, 17, 18, 23, 25, 27, 28], "refin": 11, "relev": 12, "represent": [11, 16, 21], "resourc": [4, 24], "result": 4, "retriev": [17, 18, 19, 23, 24, 25, 27], "safeti": 20, "sampl": 28, "scalabl": 16, "scale": 19, "score": 12, "sdd": [], "section": 7, "semntica": 28, "sentenc": 28, "sequenc": 21, "set": [4, 24], "shot": 19, "similar": 28, "singl": [23, 25], "song": [], "sourc": 28, "stabl": 10, "stableaudio": [], "stage": [17, 27], "start": [4, 15, 24], "static": [], "step": [4, 24], "still": [], "stimuli": 12, "strateg": 28, "supervis": 16, "synthet": 5, "system": 25, "t": [], "tag": 28, "tak": [], "task": [9, 16, 19], "technic": 25, "techniqu": 28, "test": 12, "text": [2, 5, 11, 12, 21, 28], "thi": 7, "thought": 19, "through": 11, "tip": 28, "token": 21, "tool": 19, "toward": 28, "train": [4, 24, 28], "transfer": 19, "transform": [], "trust": 20, "tune": [], "turn": [23, 25], "tutoir": [], "tutori": [7, 10, 18], "type": 9, "umbrella": [], "under": [], "understand": 24, "univers": 16, "up": [4, 24], "us": 19, "vocabulari": 27, "we": [4, 7, 24], "weak": 16, "what": [4, 7, 22, 24, 28], "why": [7, 16], "written": 5, "y": 16, "youtube8m": [], "yt8m": [], "z": 16, "zero": 19}})
\ No newline at end of file

Choi et al. [MBQF21]
Choi et al. [MBQF21]	Encoder-decoder	Playlist captioning	Captioning (playlist)	❌	Private	Private data
MusCaps [MBQF21]
MusCaps [MBQF21]	Encoder-decoder	Captioning, retrieval*	❌	Private	Private data
LP-MusicCaps [MBQF21]
PlayNTell [GHE22]	Encoder-decoder	Captioning (playlist)	✅ link	PlayNTell
LP-MusicCaps [MBQF21]	Encoder-decoder	Captioning	✅		✅ link	LP-MusicCaps
MusCaps [MBQF21]
ALCAP [HHL+23]	Encoder-decoder	Captioning, retrieval*	Captioning	❌	Private	Song Interpretation Dataset, NetEase Cloud Music Review Dataset
BLAP [LPPW24]	Encoder-decoder
BLAP [LPPW24]	Adapted LLM	Captioning	✅ link	Shutterstock (31k clips)
LLark [GDSB23]	Adapted LLM	Captioning, MQA	❌	MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune
MuLLama [LHSS24]
MU-LLaMA [LHSS24]	Adapted LLM				Captioning, MQA	✅ link	MusicQA
MusiLingo [DML+24]
MusiLingo [DML+24]	Adapted LLM				Captioning, MQA	✅ link	MusicInstruct
M2UGen[HLSS23]
M2UGen[HLSS23]	Adapted LLM				Captioning, MQA, music generation	✅ link	MUCaps, MUEdit
LLark [GDSB23]
OpenMU [ZZM+24]	Adapted LLM				Captioning, MQA	✅ link	MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune
FUTGA [WNN+24]	Adapted LLM	Captioning (fine-grained)	✅ link	FUTGA